From Dataset Recycling to Multi-Property Extraction and Beyond
This repository contains model implementations and data from the paper: From Dataset Recycling to Multi-Property Extraction and Beyond by Tomasz Dwojak, Tomasz, Michał Pietruszka, Łukasz Borchmann, Jakub Chłędowski, and Filip Graliński.
The WikiReading Recycled and WikiReading are based on the same data (articles from Wikipedia combined with Wikidata), yet differ in how they are arranged. The dataset is available at: https://applica-public.s3-eu-west-1.amazonaws.com/multi-property-extraction/wikireading-recycled.tar.
Instances from the original WikiReading dataset were merged to produce over 4M samples in the Multi-Property Extraction (MPE) paradigm. In the MPE, the system is expected to return values for multiple properties at once. Hence, it can be considered a generalization of a single property extraction task as it can be easily formulated as the MPE.
Human-annotated test set
The quality of test sets plays a pivotal role in reasoning about a system's performance. Therefore, a group of annotators went through the test set instances and assessed whether the value either appeared in the article or can be inferred from it. To make further analysis possible, we provide both datasets, before (test-A) and after (test-B) annotation.
Moreover, we determined auxiliary validation subsets with specific qualities to help improve data analysis and provide additional information at different stages of developing a system. Please refer to the paper for a detailed description.
Characteristics of different systems can be compared qualitatively by evaluating on these subsets. For instance, the long articles subset is challenging for systems that consume truncated inputs. Unseen is precisely constructed to assess systems' ability to extract previously not seen properties. On the other hand, rare can be viewed as an approximation of the system's performance on a lower-resource downstream extraction task.
Instead of performing a random split, we carefully divide the data assuming that 20% of properties should appear solely in the test set (more precisely, not seen before in train and validation sets). Around one thousand articles containing properties not seen in the remaining subsets were drafted to achieve the mentioned objective. Similarly, properties unique for the validation set were introduced to enable approximation of the test set performance without disclosing particular labels.
Additionally, test and validation sets share 10% of the properties that do not appear in the train set, increasing these subsets' size by 2,000 articles each. Another 2,000 articles containing the same properties as the train set were added to each validation and test set. All the remaining articles were used to produce the training set.
To sum up, we achieved a design where as much as 50% of the properties cannot be seen in the training split, while the remaining 50% of the properties can appear in any split. We chose these properties carefully so that the test and validation sets' size does not exceed 5,000 articles.
We made public the following models on WikiReading Recycled:
The Fairseq implementation of these models is available in the fairseq_modules directory.
Reproduction of the Results
We prepared a short tutorial on how to reproduce the results from the paper.