Release v0.3.0 · Aleph-Alpha-Research/eval-framework

What's Changed

fix: OLMES matching effort (MC Task Suite) by @fsschneider in #182
feat: Feature to have metric aggregators like Pass@K by @prabhuteja12 in #190
feat: Adds GSM8k with Olmes parity by @prabhuteja12 in #191
feat: adding pool to sandbox by @prabhuteja12 in #194
fix: scipy should be a non-optional dependency by @fsschneider in #196
feat: add BPB metric to more tasks by @fsschneider in #198
fix: bug fix in grouping per subject metrics by @prabhuteja12 in #201
feat: add OLMES variant of BigCodeBench by @tfburns in #184
feat: Task suites by @prabhuteja12 in #200
refactor: add task formatters and helper functions for choice-based e… by @martinreinhardt01 in #202
fix: match OLMES for GenQA tasks by @fsschneider in #195
chore: Removing assertions that prevent proper logging of error messages by @prabhuteja12 in #203
feat: add MultiPL HumanEval & MBPPP tasks by @tfburns in #189

Full Changelog: v0.2.14...v0.2.15