Skip to content

Commit

Permalink
Sync with upstream (#9)
Browse files Browse the repository at this point in the history
* Update generate_until_template_yaml (EleutherAI#1546)

* Update ifeval.yaml (EleutherAI#1506)

* add Arabic EXAMS benchmark (EleutherAI#1498)

* add Arabic EXAMS benchmark

* fixed the linter issue, and add more information on the readme

* Update README.md

---------

Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

* AGIEval (EleutherAI#1359)

* add agieval

* fix typo

* add cloze / math exactmatch agieval tasks, rename

* update exact-match agieval tasks, allow for multiple-correct answers

* add more detail to readme

* don't parse_math_answer twice

---------

Co-authored-by: Alex Bäuerle <alex@a13x.io>

* cli_evaluate calls simple_evaluate with the same verbosity. (EleutherAI#1563)

* add manual tqdm disabling management (EleutherAI#1569)

* add manual tqdm disabling management

* add typing to all new args

* apply precommit changes

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Fix README section on vllm integration (EleutherAI#1579)

* Link to vllm integration

* add pip install .[vllm] cmd

* Fix Jinja template for Advanced AI Risk (EleutherAI#1587)

* Proposed approach for testing CLI arg parsing (EleutherAI#1566)

* New tests for CLI args

* fix spacing

* change tests for parsing

* add tests, fix parser

* remove defaults for store_true

* Patch for Seq2Seq Model predictions (EleutherAI#1584)

* Differentiate _encode_pair setting for decoder and enc-dec models

* tok_decode to not skip special token so that eos doen't become empty string

* Update model.py

* Update model.py

* Update huggingface.py

* Update lm_eval/models/huggingface.py

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update model.py

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Add start date in results.json (EleutherAI#1592)

* Cleanup for v0.4.2 release (EleutherAI#1573)

* Update interface.md

* fix: make caching reqs always work with accelerate launch

* remove stale task migration checklist

* remove deprecation warnings

* make informative TypeErrors for get_task_dict

* bump version metadata

* fix num_fewshot printing bug

* add fewshot value to cache key

* Fix eval_logger import for mmlu/_generate_configs.py (EleutherAI#1593)

* Fix eval_logger import for mmlu/_generate_configs.py

* linter

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* use BOS token in loglikelihood (EleutherAI#1588)

* use BOS token in loglikelihood

* improve comments

* add model arg

* log prefix token id

* log prefix token id

* Update lm_eval/api/model.py

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* change name to prefix_token_id

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Revert "Patch for Seq2Seq Model predictions (EleutherAI#1584)" (EleutherAI#1601)

This reverts commit b7923a8.

* fix gen_kwargs arg reading (EleutherAI#1607)

* fix until arg processing (EleutherAI#1608)

* Fixes to Loglikelihood prefix token / VLLM (EleutherAI#1611)

* make vllm use prefix_token_id ; have prefix_token_id be optional method to define

* custom_prefix_token_id wasn't set if not passed

* Add ACLUE task (EleutherAI#1614)

* Add task ACLUE

* fix minor bug

* fix code style

* fix code style

* OpenAI Completions -- fix passing of unexpected 'until' arg (EleutherAI#1612)

* add logging of model args (EleutherAI#1619)

* add logging of model args

* nit

* Add warnings.

* nit

* add warning

* nit

* Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633)

* peft Version Assertion (EleutherAI#1635)

* peft Version Assertion

* fix the linter issue

* Seq2seq fix (EleutherAI#1604)

* fix on --task list

* add fixes to tokeniation

* differentiate encoding for seq2seq and decoder

* return token setting

* format for pre-commit

* Seq2seq fix, pt2 (EleutherAI#1630)

* getting model class only when defined

* encode_pair handles None, add_special_tokens turned into dict with default value

---------

Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com>

* Integration of NeMo models into LM Evaluation Harness library (EleutherAI#1598)

* Integration of NeMo models into LM Evaluation Harness library

* rename nemo model as nemo_lm

* move nemo section in readme after hf section

* use self.eot_token_id in get_until()

* improve progress bar showing loglikelihood requests

* data replication or tensor/pipeline replication working fine within one node

* run pre-commit on modified files

* check whether dependencies are installed

* clarify usage of torchrun in README

* Fix conditional import for Nemo LM class (EleutherAI#1641)

* Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (EleutherAI#1647)

* Add Latxa paper evaluation tasks for Basque (EleutherAI#1654)

* add basqueglue

* add eus_exams

* add eus_proficiency

* add eus_reading

* add eus_trivia

* run pre-commit

* Fix CLI --batch_size arg for openai-completions/local-completions (EleutherAI#1656)

The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion.

This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm-
testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http://localhost:8000/v1`:
```
Traceback (most recent call last):
  File "/home/michael/venv/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate
    results = evaluate(
  File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until
    list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)),
  File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks
    if len(ret) >= size or x[1] != lastuntil:
TypeError: '>=' not supported between instances of 'int' and 'str'
```

* Patch QQP prompt (EleutherAI#1661)

* TMMLU+ implementation (EleutherAI#1394)

* implementation of TMMLU+

* implemented: TMMLU+

****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding****

- 4 categories
    - STEM
    - Social Science
    - Humanities
    - Other

The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models.

```markdown
Total number of tasks in the 'test' sets: 20160
Total number of tasks in the 'validation' sets: 2247
Total number of tasks in the 'train' sets: 335
```

* Remove print from __init__.py

There was my mistake in forgetting to remove the debug print from the code.

* update: move TMMLU+ config generation program into default

* fix: we should use training set as few shots example

* update: README for TMMLU+

* update: a small changes of TMMLU+ README file

* pre-commit run thought

* Add README for TMMLU+ dataset

* run precommit

* trigger precommit again

* trigger precommit again

* isort is fussy

* isort is fussy

* format, again

* oops

* oops

---------

Co-authored-by: lintang <lintang@eleuther.ai>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Anthropic Chat API (EleutherAI#1594)

* claude3

* supply for anthropic claude3

* supply for anthropic claude3

* anthropic config changes

* add callback options on anthropic

* line passed

* claude3 tiny change

* help anthropic installation

* mention sysprompt / being careful with format in readme

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* correction bug EleutherAI#1664 (EleutherAI#1670)

* correction bug EleutherAI#1664

* add any invalid characters for Windows filenames and Unix-like systems

see:
https://gist.github.com/doctaphred/d01d05291546186941e1b7ddc02034d3?permalink_comment_id=3958715

* Update lm_eval/__main__.py

* Update scripts/zeno_visualize.py

* fix format

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Update README.md (EleutherAI#1680)

* Add delta weights model loading (EleutherAI#1712)

* added delta weights

* removed debug

* readme update

* better error handling

* autogptq warn

* warn update

* peft and delta error, explicitly deleting _model_delta

* linter fix

* Add `neuralmagic` models for `sparseml` and `deepsparse` (EleutherAI#1674)

* Add neuralmagic models for SparseML and DeepSparse

* Update to latest and add test

* Format

* Fix list to List

* Format

* Add deepsparse/sparseml to automated testing

* Update pyproject.toml

* Update pyproject.toml

* Update README

* Fixes for dtype and device

* Format

* Fix test

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Address review comments!

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix error when appending eot_token_id for generate_until tasks (EleutherAI#1699)

* Adding retries and rate limit to toxicity tasks  (EleutherAI#1620)

* reference `--tasks list` in README (EleutherAI#1726)

EleutherAI#1698

* Add XNLIeu: a dataset for cross-lingual NLI in Basque (EleutherAI#1694)

* add xnli_eu tasks

* update tasks readme

* update readme

* Fix Parameter Propagation for Tasks that have `include`  (EleutherAI#1749)

* Update task.py

* Update __init__.py

* Support individual scrolls datasets (EleutherAI#1740)

* Support individual scrolls datasets

* Add qmsum context

* Fix formatting

* Add filter registry decorator (EleutherAI#1750)

* Add register_filter decorator

* Add register_filter docs

* remove duplicated `num_fewshot: 0` (EleutherAI#1769)

* Pile 10k new task (EleutherAI#1758)

* Add Pile-10k readme

* Add Pile-10k task configuration file

* Fix m_arc choices (EleutherAI#1760)

* Update utils.py

This is a 4-choice task, option_e is null for all but 3 samples

* Fix options

Adaptive choices

* add option e

* bump multilingual arc version

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* upload new tasks (EleutherAI#1728)

* upload new tasks

* add readmes

* run linters

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* vllm lora support (EleutherAI#1756)

* vllm lora support

* remove print

* version check, rename lora kwarg

* Add option to set OpenVINO config (EleutherAI#1730)

* Add option to set OpenVINO config

* Use utils.eval_logger for logging

* evaluation tracker implementation (EleutherAI#1766)

* evaluation tracker implementation

* OVModelForCausalLM test fix

* typo fix

* moved methods args

* multiple args in one flag

* loggers moved to dedicated dir

* improved filename sanitization

* eval tracker args fix (EleutherAI#1777)

* limit fix (EleutherAI#1785)

* remove echo parameter in OpenAI completions API (EleutherAI#1779)

* remove echo parameter in OpenAI completions API

* remove context length parameter doc string

* Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` (EleutherAI#1776)

fix `----hf_hub_log_args` to `--hf_hub_log_args`

* Fix bug in setting until kwarg in openai completions (EleutherAI#1784)

* Provide ability for custom sampler for ConfigurableTask (EleutherAI#1616)

* Added fewshot sampling seeds to evaluator.simple_evaluate signature

Way to control seed of fewshot sampling
may help with EleutherAI#1591

* Added ability for custom sampler for ConfigurableTask

May be set in config like
```
fewshot_config:
  sampler: !function utils.MyFewshotSampler
```

* explicitly set fewshot random generator seed for HFLM generate_until_task test

* add backward compatibility for three args seed setup

* save seeds info to logs/reports

* Update `--tasks list` option in interface documentation (EleutherAI#1792)

* Fix Caching Tests ; Remove `pretrained=gpt2` default (EleutherAI#1775)

* link to the example output on the hub (EleutherAI#1798)

* Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant (EleutherAI#1793)

* add Hendrycks MATH (no sympy checking) variant

* add readmes for MATH tasks

* Logging Updates (Alphabetize table printouts, fix eval tracker bug) (EleutherAI#1774) (EleutherAI#1791)

* fix auto-batch size bug for seq2seq models

* alphabetize task + group tables ; fix eval tracker bug

* fix eval tracker bug

* Initial integration of the Unitxt to LM eval harness (EleutherAI#1615)

* Initial support for Unitxt datasets in LM Eval Harness

See  https://github.com/IBM/unitxt

The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file.

The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'.

* Added dataset loading check to generate_yaml

Improved error messages.

* Speed up generate_yaml

Added printouts and improved error message

* Added output printout

* Simplified integration of unitxt datasets

Store all the common yaml configuration in a yaml include shared by all datasets of the same task.

* Post code review comments - part 1

1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks
2. Added more datasets and tasks (NER, GEC)
3. Added README

* Post code review comments - part 2

1. Added install unitxt install option in pyproject.toml:
pip install 'lm_eval[unitxt]'
2. Added a check that unitxt is installed and print a clear error message if not

* Commited missing pyproject change

* Added documentation on adding datasets

* More doc changes

* add unitxt extra to readme

* run precommit

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* add task for mmlu evaluation in arc multiple choice format (EleutherAI#1745)

* add mmlu arc style evaluation

* rename arc_style to continuation

---------

Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi>
Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi>

* Update flag `--hf_hub_log_args` in interface documentation (EleutherAI#1806)

* update interface documentation with flag --hf_hub_logs_arg

* update interface documentation with flag --hf_hub_logs_arg 2

* Copal task (EleutherAI#1803)

* add copal

* change name to copal id for clarity and the task name

* remove `copal_id...` to yaml to make it work

* checkmark on README

* change group name to `copal_id`

* Adding tinyBenchmarks datasets (EleutherAI#1545)

* Add tinyBenchmarks

* Add acknowledgements

* Add ordering of outputs for data-parallel

* Run pre-commit

* Add few_shot specifications

* Add tinyBenchmarks post-processing

* add conditional import ; fix task names

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* interface doc update (EleutherAI#1807)

* Fix links in README guiding to another branch (EleutherAI#1838)

* Fix: support PEFT/LoRA with added tokens (EleutherAI#1828)

* resize model embeddings

* resize only

* tokenizer help

* load tokenizer before model

* add comment and run precommit lint

* Add log message

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865)

* fixed docs typos (EleutherAI#1863)

* Update polemo2_out.yaml (EleutherAI#1871)

* Unpin vllm in dependencies (EleutherAI#1874)

* Fix outdated links to the latest links in `docs` (EleutherAI#1876)

* [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880)

* Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790)

* fix auto-batch size bug for seq2seq models

* run linter

* Fix Brier Score (EleutherAI#1847)

`gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes.

* Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789)

* add handling for bootstrap_iters=0 case

* add more detail to docstring

* run precommit

* add mmlu tasks from pile-t5 (EleutherAI#1710)

* add mmlu tasks from pile-t5

* Update _mmlu_flan_cot_fewshot_template_yaml

* Update _mmlu_flan_cot_zeroshot_template_yaml

* Update _mmlu_flan_generative_template_yaml

* Update _mmlu_flan_loglikelihood_template_yaml

* Update _default_template_yaml

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Bigbench fix (EleutherAI#1686)

* edit process multiple-choice

* split template yaml

* remove

* modified multiple_choice tasks

* udpate

* Update multiple_choice_template_b_yaml

* Update multiple_choice_template_a_yaml

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858)

* rename lm_eval.logging module

* fix evaluation tracker args

* Updated vllm imports in vllm_causallms.py (EleutherAI#1890)

* Reorder vllm imports in vllm_causallms.py

* Update vllm_causallms.py

* [HFLM]Add support for Ascend NPU (EleutherAI#1886)

* [HFLM]Add support for Ascend NPU

Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>

* bump accelerate dependency version to 0.26.0 for NPU compat.

---------

Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* `higher_is_better` tickers in output table (EleutherAI#1893)

* Higher is better tickers in output table

* add extra check for `higher_is_better` not being None already

* Update lm_eval/evaluator.py

* fixup format I messed up

* add comment (and retrigger tests)

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Add dataset card when pushing to HF hub (EleutherAI#1898)

* dataset card initial

* few fixes

* adds groups for math, mmlu, gpqa

* added summary agrs

* moved sanitize_list to utils

* readme update

* recreate metadata moved

* multiple model support

* results latest split fix

* readme update and small refactor

* fix grouping

* add comments

* added pathlib

* corrected pathlib approach

* check whether to create a metadata card

* convert posix paths to str

* default hf org from token

* hf token value error

* Add logs after successful upload

* logging updates

* dataset card example in the readme

---------

Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com>

* Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895)

* init test 1

* fix

* this format seems to be working - need to update all other tasks with the new format

* bbh with few shot format

* fix fewshot bbh

* add mmlu flan cot

* samples of cot

* kmmlu

* fix gsm8k

* update keys for mmlu

* minerva math

* bbh

* fix

* fix samples

* small fixes to templates

* last prompt format change

* fixing prompt

* fixed minerva math format

* rm accidental commited file

* added doc for few shot samples

* Update lm_eval/loggers/evaluation_tracker.py

* Update lm_eval/loggers/evaluation_tracker.py

* Update docs/new_task_guide.md

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* added check in sampler per code review

* added the system from a function, plus an example in minerva math

* style

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* fix unit tests 1

* forcing use of test split

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Try to make existing tests run little bit faster (EleutherAI#1905)

* Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914)

Fix EleutherAI#1906

* Complete task list from pr 1727 (EleutherAI#1901)

* added tasks and task family descriptors

* continue work on task list w/ links; slightly reorganize README

* Apply suggestions from code review

* Rename file so that it'll preview in Github when viewing lm_eval/tasks folder

* Update new_task_guide.md

* Update README.md

* run linter

* Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs

* fix typo

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* apply format

---------

Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Add chat template (EleutherAI#1873)

* initial chat template

* tokenizer attribute check

* variable rename

* interface update

* system instruction

* system inst default update

* fewshot as multiturn

* typing update

* indent update

* added comments

* Adding a fewshot in a more readable way

* linting

* Moved apply chat template to LM

* multiturn alternation fix

* cache key update

* apply chat template method fix

* add system prompt hash to cache_key

* tokenizer name property for cache_key

* property name fix

* linting backward compatibility fix

* docs and errors update

* add documentation on adding chat template compatibility to model_guide

* fewshot as multiturn check fix

* saving system inst and chat template in results

* eval tracker update

* docs update

* Apply suggestions from code review

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867)

* glianorex tasks

* Create README.md

* Update README.md

* Update README.md

* fix formatting

* fix internal formatting

* Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927)

* [add] fld logical formula task (EleutherAI#1931)

* Add new Lambada translations (EleutherAI#1897)

* added tasks and task family descriptors

* configs for the new lambada translations

* continue work on task list w/ links; slightly reorganize README

* Apply suggestions from code review

* Rename file so that it'll preview in Github when viewing lm_eval/tasks folder

* Update new_task_guide.md

* Update README.md

* run linter

* Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs

* fix typo

* update `lm_eval/tasks/README.md` with task description

---------

Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: anthony <anthonydipofi@gmail.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Implement NoticIA (EleutherAI#1912)

* Noticia

* test

* Final testes implementation

* Fixes

* Fix linters

* Add The Arabic version of the PICA benchmark (EleutherAI#1917)

* Update siqa.yaml (EleutherAI#1909)

* Update basque-glue (EleutherAI#1913)

* Update README.md

* Update bec.yaml

* Update bhtc.yaml

* Update coref.yaml

* Update qnli.yaml

* Update vaxx.yaml

* Update wic.yaml

* Test output table layout consistency (EleutherAI#1916)

* sort metrics in output table

* update docstring in `consolidate_results`

* add tests for verifying consistency of table output

* update tests to account for floating point inconsistencies

* updated tests based on `pythia-14m`

* Update __main__.py (EleutherAI#1939)

* Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940)

* Results filenames handling fix (EleutherAI#1926)

* results filenames handling moved to utils

* zeno results handling fix

* tasks_for_model backward compatibility

* results files logic moved to tasks_for_model

* moved sanitize_model_name to utils

* Remove AMMLU Due to Translation (EleutherAI#1948)

* Update README.md

* Delete lm_eval/tasks/ammlu directory

* add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856)

* add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857)

* Update interface.md (EleutherAI#1955)

* Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848)

Fix bug where `self.max_tokens` was not set

* `samples` is newline delimited (EleutherAI#1930)

* `samples` is newline delimited

* updated git and pre-commit

* appease pre-commit

* nit

* Revert back for now

* Revert for now

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800)

* Update vllm_causallms.py

* adjust

---------

Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* make write_out.py explicitly error if no splits match (EleutherAI#1796)

Co-authored-by: lintangsutawika <lintang@eleuther.ai>

* fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956)

* fix: add filter to os.walk to ignore 'ipynb_checkpoints

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* add trust_remote_code  for piqa (EleutherAI#1983)

Signed-off-by: changwangss <chang1.wang@intel.com>

* Fix self assignment in neuron_optimum.py (EleutherAI#1990)

* [New Task] Add Paloma benchmark (EleutherAI#1928)

* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Fix Paloma Template yaml (EleutherAI#1993)

* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

* update on names

* fix paloma template issue

---------

Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Log `fewshot_as_multiturn` in results files (EleutherAI#1995)

* log fewshot_as_multiturn in general tracker args

* Update evaluator.py

---------

Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

* Added ArabicMMLU (EleutherAI#1987)

* Added ArabicMMLU

* Rename `ammlu` to `arabicmmlu`

* Fix Datasets `--trust_remote_code` (EleutherAI#1998)

* Add BertaQA dataset tasks (EleutherAI#1964)

* add bertaqa tasks

* rename basquetrivia-->bertaqa ; make template stub not .yaml

* add bertaqa entry to lm_eval/tasks/README.md

---------

Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Fix precommit hook, update run_models.sh

---------

Signed-off-by: changwangss <chang1.wang@intel.com>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com>
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>
Co-authored-by: Alex Bäuerle <alex@a13x.io>
Co-authored-by: Wongboo <44860323+Wongboo@users.noreply.github.com>
Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com>
Co-authored-by: Rylan Schaeffer <rylanschaeffer@gmail.com>
Co-authored-by: Vicki Boykis <vicki@mozilla.ai>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
Co-authored-by: kwrobel.eth <djstrong@gmail.com>
Co-authored-by: Nouf M. Alotaibi <63472979+noufmitla@users.noreply.github.com>
Co-authored-by: Haonan Li <nathan.8270.n@gmail.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: WoosungMyung <115716986+LameloBally@users.noreply.github.com>
Co-authored-by: Sergio Perez <sergioperezperez24@gmail.com>
Co-authored-by: Or Sharir <or@sharir.org>
Co-authored-by: Julen Etxaniz <juletxara@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: ZoneTwelve <zonetwelve159@gmail.com>
Co-authored-by: Seungwoo Ryu <seungwoo.ryu.94@gmail.com>
Co-authored-by: nicho2 <nicho2@laposte.net>
Co-authored-by: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com>
Co-authored-by: Sergio Perez <sergioperezpersonal@gmail.com>
Co-authored-by: sator-labs <129434630+sator-labs@users.noreply.github.com>
Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com>
Co-authored-by: giorgossideris <56915448+giorgossideris@users.noreply.github.com>
Co-authored-by: Nikita Lozhnikov <nikitml@gmail.com>
Co-authored-by: Chujie Zheng <chujiezhengchn@gmail.com>
Co-authored-by: Gabriel Mukobi <gabrielmukobi@gmail.com>
Co-authored-by: Zehan Li <69186130+jordane95@users.noreply.github.com>
Co-authored-by: Simran Arora <emailsimran@gmail.com>
Co-authored-by: bcicc <142823000+bcicc@users.noreply.github.com>
Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com>
Co-authored-by: Muhammad Bin Usman <muhammadbin.2003@gmail.com>
Co-authored-by: ciaranby <48831615+ciaranby@users.noreply.github.com>
Co-authored-by: LSinev <LSinev@users.noreply.github.com>
Co-authored-by: aditya thomas <aditya.thomas@alum.mit.edu>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: jonabur <135807120+jonabur@users.noreply.github.com>
Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi>
Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi>
Co-authored-by: Edd <68678137+Erland366@users.noreply.github.com>
Co-authored-by: Lucas Weber <35227161+LucWeber@users.noreply.github.com>
Co-authored-by: Nick Doiron <ndoiron@mapmeld.com>
Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com>
Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com>
Co-authored-by: Edward Gan <efuzzy@gmail.com>
Co-authored-by: DongGeon Lee <dg.lee@postech.ac.kr>
Co-authored-by: Huazhong Ji <hzji210@gmail.com>
Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com>
Co-authored-by: zhabuye <2947436155@qq.com>
Co-authored-by: Nathan Habib <nathan.habib@huggingface.com>
Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: anthony-dipofi <anthonydipofi@gmail.com>
Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com>
Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com>
Co-authored-by: MorishT <106973776+MorishT@users.noreply.github.com>
Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com>
Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com>
Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>
Co-authored-by: johnwee1 <91670254+johnwee1@users.noreply.github.com>
Co-authored-by: Wang, Chang <491521017@qq.com>
Co-authored-by: Yazeed Alnumay <61038456+Yazeed7@users.noreply.github.com>
  • Loading branch information
Show file tree
Hide file tree
Showing 996 changed files with 22,541 additions and 9,706 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/new_tasks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,13 @@ jobs:
with:
fetch-depth: 2 # OR "2" -> To retrieve the preceding commit.

# Uses the tj-actions/changed-files@v37 action to check for changes.
# Uses the tj-actions/changed-files action to check for changes.
# Outputs provided here: https://github.com/tj-actions/changed-files#outputs
# The `files_yaml` input optionally takes a yaml string to specify filters,
# and prepends the filter name to the standard output names.
- name: Check task folders
id: changed-tasks
uses: tj-actions/changed-files@v37.1.2
uses: tj-actions/changed-files@v44.5.2
with:
# tasks checks the tasks folder and api checks the api folder for changes
files_yaml: |
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
env:
SKIP: "no-commit-to-branch,mypy"

uses: pre-commit/action@v3.0.0
uses: pre-commit/action@v3.0.1
# # mypy turned off for now
# - name: Lint with mypy
# run: mypy . --ignore-missing-imports --check-untyped-defs --explicit-package-bases --warn-unreachable
Expand All @@ -56,7 +56,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e '.[dev,anthropic,sentencepiece,optimum]' --extra-index-url https://download.pytorch.org/whl/cpu
pip install -e '.[dev,anthropic,sentencepiece,optimum,deepsparse,sparseml]' --extra-index-url https://download.pytorch.org/whl/cpu
# Install optional git dependencies
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
Expand Down
18 changes: 9 additions & 9 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ repos:
- id: check-case-conflict
- id: check-json
- id: check-merge-conflict
args: [--assume-in-merge]
- id: check-symlinks
- id: check-yaml
args: ["--unsafe"]
Expand All @@ -28,8 +29,7 @@ repos:
- id: mixed-line-ending
args: [--fix=lf]
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.2.2
rev: v0.4.8
hooks:
# Run the linter.
- id: ruff
Expand All @@ -38,17 +38,17 @@ repos:
# Run the formatter.
- id: ruff-format
- repo: https://github.com/codespell-project/codespell
rev: v2.2.6
rev: v2.3.0
hooks:
- id: codespell
exclude: >
(?x)^(
.*\.json|ignore.txt|lm_eval/tasks/.*|.*yaml|.*\.ipynb
)$
args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.5.1
hooks:
- id: mypy
additional_dependencies: [".[sentencepiece,multilingual,promptsource,gptq]", "types-PyYAML", "types-requests"]
exclude: ^tests/.*$
# - repo: https://github.com/pre-commit/mirrors-mypy
# rev: v1.5.1
# hooks:
# - id: mypy
# additional_dependencies: [".[sentencepiece,multilingual,promptsource,gptq]", "types-PyYAML", "types-requests"]
# exclude: ^tests/.*$
104 changes: 97 additions & 7 deletions README.md

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Welcome to the docs for the LM Evaluation Harness!

## Table of Contents

* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md)
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/task_guide.md).
* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](./interface.md)
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](./model_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](./new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](./task_guide.md).
35 changes: 25 additions & 10 deletions docs/interface.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th

This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:

- `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
- `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#model-apis-and-inference-servers) for a full list of enabled model names and supported libraries or APIs.

- `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)

- `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
- `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. A list of supported tasks can be viewed with `--tasks list`.

- `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.

Expand Down Expand Up @@ -42,13 +42,28 @@ This mode supports a number of command-line arguments, the details of which can

- `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.

- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing ` lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`.

- `--system_instruction`: Specifies a system instruction string to prepend to the prompt.

- `--apply_chat_template` : If this flag is on, a chat template will be applied to the prompt. For Hugging Face models, the chat template is taken from the tokenizer, if the tokenizer does not have a chat template, a default one will be applied. For other models, chat templating is not currently implemented.

- `--fewshot_as_multiturn` : If this flag is on, the Fewshot examples are treated as a multi-turn conversation. Questions are provided as user content and answers are provided as assistant responses. Requires `--num_fewshot` to be set to be greater than 0, and `--apply_chat_template` to be on.

- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.

* `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.

* `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list (here.)[https://docs.wandb.ai/ref/python/init]. e.g., ```--wandb_args project=test-project,name=test-run```
* `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```

* `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments:
* `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`. If not provided, the results will be pushed to the owner of the Hugging Face token,
* `hub_repo_name` - repository name on Hugging Face Hub, e.g., `lm-eval-results`,
* `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`,
* `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set,
* `public_repo` - whether the repository is public, can be `True` or `False`,
* `leaderboard_url` - URL to the leaderboard, e.g., `https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard`.
* `point_of_contact` - Point of contact for the results dataset, e.g., `yourname@example.com`.

## External Library Usage

Expand Down Expand Up @@ -77,7 +92,7 @@ task_manager = lm_eval.tasks.TaskManager()

# Setting `task_manager` to the one above is optional and should generally be done
# if you want to include tasks from paths other than ones in `lm_eval/tasks`.
# `simple_evaluate` will instantiate its own task_manager is the it is set to None here.
# `simple_evaluate` will instantiate its own task_manager if it is set to None here.
results = lm_eval.simple_evaluate( # call simple_evaluate
model=lm_obj,
tasks=["taskname1", "taskname2"],
Expand Down Expand Up @@ -112,8 +127,8 @@ my_model = initialize_my_model()
# - `Your_LM.generate_until()`
lm_obj = Your_LM(model=my_model, batch_size=16)

# The task_manager indexes tasks including ones
# specified by the user through `include_path`
# optional: the task_manager indexes tasks including ones
# specified by the user through `include_path`.
task_manager = lm_eval.tasks.TaskManager(
include_path="/path/to/custom/yaml"
)
Expand All @@ -132,15 +147,15 @@ task_dict = lm_eval.tasks.get_task_dict(
task_manager # A task manager that allows lm_eval to
# load the task during evaluation.
# If none is provided, `get_task_dict`
# will instantiated one itself, but this
# will instantiate one itself, but this
# only includes the stock tasks so users
# will need to set this if including
# custom paths is required.
)

def evaluate(
results = evaluate(
lm=lm_obj,
task_dict=task_dict,
...
):
)
```
49 changes: 48 additions & 1 deletion docs/model_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ In order to properly evaluate a given LM, we require implementation of a wrapper

## Setup

To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your model, and install the project requirements in your environment:

```sh
# After forking...
Expand Down Expand Up @@ -107,6 +107,53 @@ Using this decorator results in the class being added to an accounting of the us

We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py .

## Chat Templating

Many models are fine-tuned with a [Chat Template](https://huggingface.co/docs/transformers/main/en/chat_templating) in order to enable back-and-forth interaction between a "User"'s queries and the model (often called "Assistant")'s responses. It can be desirable to evaluate fine-tuned models on evaluation tasks while wrapped in the conversational format they expect.

In order to make your model optionally compatible with a chat format, three additional methods must be implemented:

```python
class MyCustomLM(LM):
#...
@property
def tokenizer_name(self) -> str:
# should return a string denoting the name of the model's tokenizer and/or the accompanying chat template.

@property
def chat_template(self) -> str:
# should return a chat template formatting string that is used to build prompt from a user/assistant chat history.
# this will be saved in the evaluation results for reproducibility.

def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
# responsible for taking as input a chat history that would be fed into the model, and
# rendering it as a string that can be then tokenized and input into the model.
#...
```

- `apply_chat_template`
- This method performs the bulk of the work required for chat-formatting.
- As input, a `chat_history: List[Dict[str, str]]` is passed in. This is a transcript of a conversation of a form similar to
```
[
{"system": <user-provided system message such as "You are a helpful math-focused chatbot">},
{"user": <task example - a few-shot example 'input'>}
{"assistant": <correct response to the above example>},
# ... more few-shot examples, potentially
{"user": <test set query--response on which we will evaluate>},
]
```
which can then be converted into a string input.
- The output is a string representing this conversation that can be fed into the model.
- For example, this consists of simply calling `tokenizer.apply_chat_template` for HFLM--see the implementation there for reference.
- `tokenizer_name`
- LM Eval Harness supports [caching requests](https://github.com/EleutherAI/lm-evaluation-harness/blob/4902aaaf1f374682f95ac25fe2e13b23faddc91a/lm_eval/__main__.py#L140) that are sent to a model, for faster setup when repeating an already-performed evaluation.
- However, we don't want to use the cache of chat transcripts rendered using one chat template or system prompt to send to a model with a different template! So, we use this `lm.tokenizer_name` string to distinguish caches for a given model (and chat template) from one another.
- `chat_template`
- Chat templates are typically provided as a Jinja template string or a string formatted with str.format to include user and assistant messages in a single prompt. This template string is saved in the evaluation results to ensure reproducibility.

If not implemented for a given model type, the flags `--apply_chat_template` , `--fewshot_as_multiturn`, and `--system_instruction` cannot be used.

## Other

**Pro tip**: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in *descending order by total input length* via `lm_eval.utils.Reorderer`. Take a look at `lm_eval.models.hf_causal.HFLM` to see how this is done, and see if you can implement it in your own model!
Expand Down
Loading

0 comments on commit 7a89464

Please sign in to comment.