TRAM: Benchmarking Temporal Reasoning for Large Language Models

This repository contains datasets, data processing code, model descriptions, and a datasheet for the benchmark used for 'TRAM: Benchmarking Temporal Reasoning in Large Language Models'.

Datasets

TRAM encompasses ten temporal reasoning tasks, presented as multiple-choice questions (MCQs) across a range of time-related domains. For clarity, we ensure that each question has only one correct answer. TRAM incorporates existing natural language understanding datasets, human-crafted templates and questions, web sources, and program generation. Answers have been derived through a combination of expert annotations and programmatic generation. The benchmark includes 526,668 problems in total. For each dataset, we introduce a few-shot development set, with 5 questions per category, and a separate test set for evaluation. All datasets used for experiments can be downloaded in /datasets folder. Overview of ten tasks included in the benchmark:

_{[1] Zhou et al., 2019, [2] Rajpurkar et al., 2016, [3] Uzzaman et al., 2013, [4] Williams et al., 2018, [5] Bowman et al., 2015, [6] Roemmele et al., 2011, [7] Mostafazadeh et al., 2016, [8] Mostafazadeh et al., 2017}

Note: The “Data Size" column aggregates totals from both the development and test sets. “K-Way MC" signifies a multiple-choice response format with K options. Amb. Res. denotes Ambiguity Resolution. NLI stands for natural language inference. “Same" indicates the text source is the same as the row above.

For more details, please refer to the paper.

Models

We evaluate the performance of several well-known language models on the TRAM benchmark, which is organized into two main categories. In the first category, we consider four popular large language models (LLMs): the open-source model Llama-2-13b-chat, and the closed-source models PaLM-bison-chat, GPT-3.5-turbo, and GPT-4. We evaluate each model using two prompting strategies: standard prompting (SP) and chain-of-thought (CoT) prompting. Under both strategies, the models undergo tests in zero-shot and 5-shot settings. For all models, we apply greedy decoding (i.e., temperature = 0) for response generation. Each of these models is accessed using its corresponding API key.

In the second category, we consider minimal supervision as opposed to traditional fully supervised learning in order to establish baseline evaluations. Specifically, we employ four representative BERT-style models, including BERT-base, BERT-large, RoBERTa-base, and RoBERTa-large. For the temporal NLI task, we employ the Sequence Classification variant of BERT and RoBERTa from Huggingface (i.e., BertForSequenceClassification and RobertaForSequenceClassification), given its suitability for the task's structure. However, for the other tasks, we utilize the Multiple Choice variant of BERT and RoBERTa from Huggingface (i.e., BertForMultipleChoice, RobertaForMultipleChoice).

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data_processing		data_processing
datasets		datasets
image_sources		image_sources
LICENSE		LICENSE
README.md		README.md
data_sources.txt		data_sources.txt
datasheet_for_TRAM_benchmark.pdf		datasheet_for_TRAM_benchmark.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_processing

data_processing

datasets

datasets

image_sources

image_sources

LICENSE

LICENSE

README.md

README.md

data_sources.txt

data_sources.txt

datasheet_for_TRAM_benchmark.pdf

datasheet_for_TRAM_benchmark.pdf

Repository files navigation

TRAM: Benchmarking Temporal Reasoning for Large Language Models

Datasets

Models

About

Releases

Packages

Languages

License

EternityYW/TRAM-Benchmark

Folders and files

Latest commit

History

Repository files navigation

TRAM: Benchmarking Temporal Reasoning for Large Language Models

Datasets

Models

About

Topics

Resources

License

Stars

Watchers

Forks

Languages