Skip to content
/ zest Public

Code and dataset "ZEST" from "Learning from task descriptions", Weller et al, EMNLP 2020


Notifications You must be signed in to change notification settings


Repository files navigation

Learning from Task Descriptions

Learn NLP tasks from descriptions.

This repository holds companion code for the EMNLP 2020 paper: "Learning from Task Descriptions".

Jump to the most relevant section:

  • Data: Obtain the dataset: ZEST.
  • Evaluation: Evaluate a model's predictions on ZEST.
  • Crowdsourcing: Run ZEST's crowdsourcing templates.
  • Modeling: Train and predict with models on ZEST.
  • Tests: Run the repository's tests and other checks.
  • Citation: Cite this work.
  • Contact: Contact us.


You can download the data from This file contains the train, validation and unlabeled test sets. To evaluate on the test set, submit your predictions to the leaderboard.

The dataset is newline separated JSON. Each line has a JSON object representing a task. Each task's JSON has a question key offering the task description and an examples key providing examples for that task. Train and validation have labels for each example, while test does not. For example:

    "question": "After leaving office, where did this president go to retire?",
    "examples": [
            "context": "Dwight David 'Ike' Eisenhower .....",
            "answer": "n/a"
        ... more contexts and answers here ...

The answer can be a single string as above, or a list of strings in the case where there is more then one possible valid answer from different annotators. Models may only submit one answer to each (question, context) pair, which the evaluation script considers correct if it is among the valid answers. The evaluation script will randomly choose one answer if the predictions file contains multiple answers.

For the structure question type, the answer is type List[Dict[str: Union[str, List[str]]]], for example,

"question": "What rock types are there at this national park and what era did they form?",
"examples": [
    {"context": "... The younger rocks of sedimentary origin formed during the Paleozoic Era...",
     "answer": [{"rock_types": "sedimentary ", "era_formed": "Paleozoic"}]},

Each of the values in the key-value answers may contain one str or List[str] in the case of multiple correct answers, e.g. [{"rock_types": "granitic", "era_formed": ["Mesozoic", "Paleozoic ", "Paleozoic and Mesozoic"]}]. Each value is scored in the same way as the simple answers in the non-structure tasks.


To evaluate predictions on ZEST, use the evaluation script:

python bin/                  \
  --predictions_path <your_predictions_file> \
  --dev-path <path_to_dev_data>              \
  --output-path <output_file>

Your predictions file should simply be each prediction separated by a newline character (or either a JSON Lines or a one column CSV file).

The results will be written to <output_file> and written to stdout.


To create ZEST, we crowdsourced tasks (questions) and examples for them from Mechanical Turk. The crowdsourcing templates we used reside in mturk-templates/. Task generation templates reside in mturk-templates/tasks/, while labeling templates reside in mturk-templates/labels/.

Running a HIT

To run one of our crowdsourcing templates on Mechanical Turk, use amti.

First, install amti:

$ pip install git+

Then, launch the batch on Mechanical Turk:

$ amti create-batch \
    mturk-templates/$TEMPLATE_TYPE/$TEMPLATE_NAME/definition \

You can also preview the HIT by running a local webserver with amti preview-batch, e.g.

$ amti preview-batch \
    mturk-templates/$TEMPLATE_TYPE/$TEMPLATE_NAME/definition \

Annotation Pipeline

We used the base/ template to generate questions, which were then fed into the paraphrase/, semantics-flips/, combination/, and output-structure/ templates to create the additional question types. Tasks were labeled based on whether they called for classification, extraction, or structured output. Each type has a separate labeling template in mturk-templates/labels/.


We evaluated two baseline models: T5 and BART. Each model has a separate environment and process for running, see below.



The T5 install requires Python 3.6 or above.

First, install the project's dependencies:


Next, make sure you have the following environment variables set:

  1. ZEST_DATASET_DIR: The directory containing the zest dataset.
  2. ZEST_PREPROCESSED_DATASET_DIR: The directory containing the preprocessed zest dataset.
  3. ZEST_TFDS_DATASETS_DIR: The directory for storing the TFDS (tensorflow datasets) datasets.

Training requires TPUs, for training all directories will have to be paths into Google Storage buckets, and you'll also need the environment variables:

  1. PROJECT: Your Google Cloud project's ID.
  2. ZONE: The zone in which your virtual machine is located.
  3. TPU_NAME: The name of your TPU.
  4. TPU_TOPOLOGY: The topology of the TPU.

Then, preprocess the zest data, using the script:

$ ./bin/ --help
Usage: [OPTIONS]

  Preprocess the zest dataset for training T5.

  --src TEXT  The source directory from which to read the zest dataset.
              Defaults to the ZEST_DATASET_DIR environment variable.
  --dst TEXT  The destination directory to which to write the preprocessed
              dataset. Defaults to the ZEST_PREPROCESSED_DATASET_DIR
              environment variable.
  --help      Show this message and exit.

Finally, verify your installation:


Training and Evaluation

To train T5, use ./bin/ For example:

./bin/                \
  "zest"                      \
  --pretrained-model "11B"        \
  --n-steps "25000"               \
  --learning-rate "1e-3"          \
  --batch-size "32"               \
  --model-parallelism "16"        \
  --save-checkpoints-steps "2500" \
  --n-checkpoints-to-keep "10"    \
  --tpu-name "${TPU_NAME}"        \
  --tpu-topology "8x16"

The script is self-documenting, so use the --help option for detailed information.

To run prediction with T5, use ./bin/ For example:

./bin/                \
  "zest"                    \
  --split "validation"          \
  --batch-size "32"             \
  --model-parallelism "16"      \
  --tpu-name "${TPU_NAME}"      \
  --tpu-topology "8x16"

To evaluate the predictions, follow the instructions in the Evaluation section. The script, ./bin/, is also self-documenting.



Run ./bin/ This creates a conda env zest_bart with the code and dependencies.

Training and Evaluation

To run and evaluate the BART baselines, run ./bin/ /path/to/zest/data 5e-5 15. This command first trains BART for 15 epochs with learning rate 5e-5, writes out the predictions on the development set to a file, and then uses the evaluation script to calculate the official metrics.

For hardware, we ran the training and evaluation on a RTX 8000 GPU with 48GB of RAM. If your GPU has less memory, you may need to decrease the number of beams for decoding, or the sequence lengths (see arguments eval_beams, val_max_target_length, and eval_max_gen_length to bin/


The code is formatted with black. You can run the formatter using the bin/format script:

$ ./bin/format

To run code quality checks, use the bin/verify script:

$ ./bin/verify

For fine-grained control of which tests to run, use pytest directly:

$ pytest

You can also skip slower tests by passing the --skip-slow (-s) flag:

$ pytest --skip-slow


If you build off this code, data, or work, please cite the paper as follows:

    title = "Learning from Task Descriptions",
    author = "Weller, Orion  and
      Lourie, Nicholas  and
      Gardner, Matt  and
      Peters, Matthew",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "",
    pages = "1361--1375",
    abstract = "Typically, machine learning systems solve new tasks by training on thousands of examples. In contrast, humans can solve new tasks by reading some instructions, with perhaps an example or two. To take a step toward closing this gap, we introduce a framework for developing NLP systems that solve new tasks after reading their descriptions, synthesizing prior work in this area. We instantiate this frame- work with a new English language dataset, ZEST, structured for task-oriented evaluation on unseen tasks. Formulating task descriptions as questions, we ensure each is general enough to apply to many possible inputs, thus comprehensively evaluating a model{'}s ability to solve each task. Moreover, the dataset{'}s structure tests specific types of systematic generalization. We find that the state-of-the-art T5 model achieves a score of 12{\%} on ZEST, leaving a significant challenge for NLP researchers.",


For public, non-sensitive matters: please file an issue on this repository.

For private or sensitive inquiries, please contact the authors of paper directly.


Code and dataset "ZEST" from "Learning from task descriptions", Weller et al, EMNLP 2020







No releases published