EpiK-Eval: Evaluation for Language Models as Epistemic Models

Benchmark to evaluate the capability of language models to consolidate and recall information from multiple training documents in order to answer questions during inference.

Requirements

python >= 3.10.0
accelerate >= 0.24.1
pandas >= 2.1.1
protobuf >= 4.24.4
pytorch >= 2.1.0
sentencepiece >= 0.1.99
transformers >= 4.35.1
matplotlib (Optional)
wandb >= 0.15.12 (Optional)

Install

git clone https://github.com/chandar-lab/EpiK-Eval.git
cd EpiK-Eval
make
pip install --no-deps .

Run the benchmark

To benchmark a model on unsegmented stories:

bash configs/<model>/run_unsegmented.sh

or on segmented stories:

bash configs/<model>/run_segmented.sh

Dataset Privacy Notice

Our dataset is provided in an encrypted format and is decrypted only during the installation process. To avoid this data leaking into training corpuses, we urge all users to exercise caution and ensure they do not inadvertently publish or push the unencrypted dataset online. As a precautionary measure, the unencrypted dataset is included in the .gitignore file. Your cooperation in maintaining the privacy of our dataset is greatly appreciated.

Manually Decrypt Dataset

Our dataset is automatically decrypted during installation. If for ever reason, one wants to manually decrypt the dataset, for example, without installing our repo, one can run:

make

For more information on the format of our dataset, see data/README.md

Additional Scripts

We provide the following three additional scripts. Each script is meant to be ran in the scripts/ directory, e.g., python generate_paper_plots.py.

Dataset Generation

To generate a new dataset, one can run:

python generate_dataset.py

The default parameters generate a dataset with the same format as the one we provide. Run python scripts/generate_dataset.py --help for details.

Compute Benchmark Metrics

Once a model has been benchmarked, the various metrics shown in our paper can be computed via:

python compute_paper_metrics.py --model_answer_log=logs/example_log.csv

Generate Paper Plots

To recreate the plots shown in our paper, one can run:

python generate_paper_plots.py

We already provide these figures in figures/

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
figures		figures
scripts		scripts
src/epik_eval		src/epik_eval
.gitignore		.gitignore
EpiK-Eval_logo.webp		EpiK-Eval_logo.webp
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

data

data

figures

figures

scripts

scripts

src/epik_eval

src/epik_eval

.gitignore

.gitignore

EpiK-Eval_logo.webp

EpiK-Eval_logo.webp

LICENSE.txt

LICENSE.txt

Makefile

Makefile

README.md

README.md

pyproject.toml

pyproject.toml

Repository files navigation

EpiK-Eval: Evaluation for Language Models as Epistemic Models

Requirements

Install

Run the benchmark

Dataset Privacy Notice

Manually Decrypt Dataset

Additional Scripts

Dataset Generation

Compute Benchmark Metrics

Generate Paper Plots

About

Languages

License

chandar-lab/EpiK-Eval

Folders and files

Latest commit

History

Repository files navigation

EpiK-Eval: Evaluation for Language Models as Epistemic Models

Requirements

Install

Run the benchmark

Dataset Privacy Notice

Manually Decrypt Dataset

Additional Scripts

Dataset Generation

Compute Benchmark Metrics

Generate Paper Plots

About

Resources

License

Stars

Watchers

Forks

Languages