Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Adding DatasetReader for the bAbI tasks and Qangaroo #2194

Merged
merged 18 commits into from Dec 18, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
bcd0586
Adding DatasetReader for the bAbI tasks
Dec 17, 2018
2f008e8
Adding DatasetReader for Qangaroo dataset
nicola-decao Dec 17, 2018
5cffc0f
Fixing issues for pull request of bAbI and Qangaroo dataset readers
nicola-decao Dec 18, 2018
68ee97d
Added test for bAbI dataset reader and modified types
nicola-decao Dec 18, 2018
0a74a68
Fixing test for bAbI dataset reader and modified types
nicola-decao Dec 18, 2018
bc96086
Adding test for Qangaroo dataset reader and fixing issues on bAbI reader
nicola-decao Dec 18, 2018
95f8377
Fixing test for Qangaroo dataset reader and fixing issues on Qangaroo…
nicola-decao Dec 18, 2018
d6e5cc2
Fixing test for Qangaroo dataset reader and fixing typing bugs on Qan…
nicola-decao Dec 18, 2018
cc29b45
Fixing test for Qangaroo dataset reader and bAbI
nicola-decao Dec 18, 2018
6370699
Fixing test for Qangaroo dataset reader and bAbI
nicola-decao Dec 18, 2018
57c906a
Fixing typing of bAbI reader
nicola-decao Dec 18, 2018
62ac78f
Trying to fix mismatch of bAbI reader context
nicola-decao Dec 18, 2018
c52a50b
Adding pylint: disable=arguments-differ to bAbI/Qangaroo readers
nicola-decao Dec 18, 2018
897a0c9
Merge branch 'master' into master
nicola-decao Dec 18, 2018
85a8c0c
Style correction, documentation and minor edits to bAbI/Qangaroo readers
nicola-decao Dec 18, 2018
889d004
Merge branch 'master' of github.com:nicola-decao/allennlp
nicola-decao Dec 18, 2018
a4ade1e
answer_idx -> answer_index in testing Qangaroo reader
nicola-decao Dec 18, 2018
12469d6
Added documentation of bAbI/Qangaroo readers
nicola-decao Dec 18, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 2 additions & 1 deletion allennlp/data/dataset_readers/__init__.py
Expand Up @@ -17,7 +17,7 @@
from allennlp.data.dataset_readers.language_modeling import LanguageModelingReader
from allennlp.data.dataset_readers.multiprocess_dataset_reader import MultiprocessDatasetReader
from allennlp.data.dataset_readers.penn_tree_bank import PennTreeBankConstituencySpanDatasetReader
from allennlp.data.dataset_readers.reading_comprehension import SquadReader, TriviaQaReader, QuACReader
from allennlp.data.dataset_readers.reading_comprehension import SquadReader, TriviaQaReader, QuACReader, QangarooReader
from allennlp.data.dataset_readers.semantic_role_labeling import SrlReader
from allennlp.data.dataset_readers.semantic_dependency_parsing import SemanticDependenciesDatasetReader
from allennlp.data.dataset_readers.seq2seq import Seq2SeqDatasetReader
Expand All @@ -31,3 +31,4 @@
WikiTablesDatasetReader, AtisDatasetReader, NlvrDatasetReader, TemplateText2SqlDatasetReader)
from allennlp.data.dataset_readers.semantic_parsing.quarel import QuarelDatasetReader
from allennlp.data.dataset_readers.simple_language_modeling import SimpleLanguageModelingDatasetReader
from allennlp.data.dataset_readers.babi import BabiReader
96 changes: 96 additions & 0 deletions allennlp/data/dataset_readers/babi.py
@@ -0,0 +1,96 @@
import logging

from typing import Dict, List
from overrides import overrides

from allennlp.common.file_utils import cached_path
from allennlp.data.dataset_readers.dataset_reader import DatasetReader
from allennlp.data.instance import Instance
from allennlp.data.fields import Field, TextField, ListField, IndexField
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token

logger = logging.getLogger(__name__) # pylint: disable=invalid-name


@DatasetReader.register("babi")
class BabiReader(DatasetReader):
"""
Reads one single task in the bAbI tasks format as formulated in
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
(https://arxiv.org/abs/1502.05698). Since this class handle a single file,
if one wants to load multiple tasks together it has to merge them into a
single file and use this reader.

nicola-decao marked this conversation as resolved.
Show resolved Hide resolved
Parameters
----------
keep_sentences: ``bool``, optional, (default = ``False``)
Whether to keep each sentence in the context or to concatenate them.
Default is ``False`` that corresponds to concatenation.
token_indexers : ``Dict[str, TokenIndexer]``, optional (default=``{"tokens": SingleIdTokenIndexer()}``)
We use this to define the input representation for the text. See :class:`TokenIndexer`.
lazy : ``bool``, optional, (default = ``False``)
Whether or not instances can be consumed lazily.
"""
def __init__(self,
keep_sentences: bool = False,
token_indexers: Dict[str, TokenIndexer] = None,
lazy: bool = False) -> None:

super().__init__(lazy)
self._keep_sentences = keep_sentences
self._token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}

@overrides
def _read(self, file_path: str):
# if `file_path` is a URL, redirect to the cache
file_path = cached_path(file_path)

logger.info("Reading file at %s", file_path)

with open(file_path) as dataset_file:
dataset = dataset_file.readlines()

logger.info("Reading the dataset")

context: List[List[str]] = [[]]
for line in dataset:
if '?' in line:
question_str, answer, supports_str = line.replace('?', ' ?').split('\t')
question = question_str.split()[1:]
supports = [int(support) - 1 for support in supports_str.split()]

yield self.text_to_instance(context, question, answer, supports)
else:
new_entry = line.replace('.', ' .').split()[1:]

if line[0] == '1':
context = [new_entry]
else:
context.append(new_entry)

@overrides
def text_to_instance(self, # type: ignore
context: List[List[str]],
question: List[str],
answer: str,
supports: List[int]) -> Instance:

# pylint: disable=arguments-differ
fields: Dict[str, Field] = {}

if self._keep_sentences:
context_field_ks = ListField([TextField([Token(word) for word in line],
self._token_indexers)
for line in context])

fields['supports'] = ListField([IndexField(support, context_field_ks) for support in supports])
else:
context_field = TextField([Token(word) for line in context for word in line],
self._token_indexers)

fields['context'] = context_field_ks if self._keep_sentences else context_field
fields['question'] = TextField([Token(word) for word in question], self._token_indexers)
fields['answer'] = TextField([Token(answer)], self._token_indexers)

return Instance(fields)
Expand Up @@ -8,3 +8,4 @@
from allennlp.data.dataset_readers.reading_comprehension.squad import SquadReader
from allennlp.data.dataset_readers.reading_comprehension.quac import QuACReader
from allennlp.data.dataset_readers.reading_comprehension.triviaqa import TriviaQaReader
from allennlp.data.dataset_readers.reading_comprehension.qangaroo import QangarooReader
90 changes: 90 additions & 0 deletions allennlp/data/dataset_readers/reading_comprehension/qangaroo.py
@@ -0,0 +1,90 @@
import json
import logging

from typing import Dict, List
from overrides import overrides

from allennlp.common.file_utils import cached_path
from allennlp.data.dataset_readers.dataset_reader import DatasetReader
from allennlp.data.instance import Instance
from allennlp.data.fields import Field, TextField, ListField, MetadataField, IndexField
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Tokenizer, WordTokenizer

logger = logging.getLogger(__name__) # pylint: disable=invalid-name


@DatasetReader.register("qangaroo")
class QangarooReader(DatasetReader):
"""
Reads a JSON-formatted Qangaroo file and returns a ``Dataset`` where the ``Instances`` have six
fields: ``candidates``, a ``ListField[TextField]``, ``query``, a ``TextField``, ``supports``, a
``ListField[TextField]``, ``answer``, a ``TextField``, and ``answer_index``, a ``IndexField``.
We also add a ``MetadataField`` that stores the instance's ID and annotations if they are present.

Parameters
----------
tokenizer : ``Tokenizer``, optional (default=``WordTokenizer()``)
We use this ``Tokenizer`` for both the question and the passage. See :class:`Tokenizer`.
Default is ```WordTokenizer()``.
token_indexers : ``Dict[str, TokenIndexer]``, optional
We similarly use this for both the question and the passage. See :class:`TokenIndexer`.
Default is ``{"tokens": SingleIdTokenIndexer()}``.
"""
def __init__(self,
tokenizer: Tokenizer = None,
token_indexers: Dict[str, TokenIndexer] = None,
lazy: bool = False) -> None:

super().__init__(lazy)
self._tokenizer = tokenizer or WordTokenizer()
self._token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}

@overrides
def _read(self, file_path: str):
# if `file_path` is a URL, redirect to the cache
file_path = cached_path(file_path)

logger.info("Reading file at %s", file_path)

with open(file_path) as dataset_file:
dataset = json.load(dataset_file)

logger.info("Reading the dataset")
for sample in dataset:

instance = self.text_to_instance(sample['candidates'], sample['query'], sample['supports'],
sample['id'], sample['answer'],
sample['annotations'] if 'annotations' in sample else [[]])

yield instance

@overrides
def text_to_instance(self, # type: ignore
candidates: List[str],
query: str,
supports: List[str],
_id: str = None,
answer: str = None,
annotations: List[List[str]] = None) -> Instance:

# pylint: disable=arguments-differ
fields: Dict[str, Field] = {}

candidates_field = ListField([TextField(candidate, self._token_indexers)
for candidate in self._tokenizer.batch_tokenize(candidates)])

fields['query'] = TextField(self._tokenizer.tokenize(query), self._token_indexers)

fields['supports'] = ListField([TextField(support, self._token_indexers)
for support in self._tokenizer.batch_tokenize(supports)])

fields['answer'] = TextField(self._tokenizer.tokenize(answer), self._token_indexers)

fields['answer_index'] = IndexField(candidates.index(answer), candidates_field)

fields['candidates'] = candidates_field

fields['metadata'] = MetadataField({'annotations': annotations, 'id': _id})

return Instance(fields)
27 changes: 27 additions & 0 deletions allennlp/tests/data/dataset_readers/babi_reader_test.py
@@ -0,0 +1,27 @@
# pylint: disable=no-self-use,invalid-name
import pytest

from allennlp.common import Params
from allennlp.common.util import ensure_list
from allennlp.data.dataset_readers import BabiReader
from allennlp.common.testing import AllenNlpTestCase


class TestBAbIReader:
@pytest.mark.parametrize('keep_sentences, lazy', [(False, False), (False, True), (True, False), (True, True)])
def test_read_from_file(self, keep_sentences, lazy):
reader = BabiReader(keep_sentences=keep_sentences, lazy=lazy)
instances = ensure_list(reader.read(AllenNlpTestCase.FIXTURES_ROOT / 'data' / 'babi.txt'))
assert len(instances) == 8

if keep_sentences:
assert [t.text for t in instances[0].fields['context'][3].tokens[3:]] == ['of', 'wolves', '.']
assert [t.sequence_index for t in instances[0].fields['supports']] == [0, 1]
else:
assert [t.text for t in instances[0].fields['context'].tokens[7:9]] == ['afraid', 'of']

def test_can_build_from_params(self):
reader = BabiReader.from_params(Params({'keep_sentences': True}))
# pylint: disable=protected-access
assert reader._keep_sentences
assert reader._token_indexers['tokens'].__class__.__name__ == 'SingleIdTokenIndexer'
@@ -0,0 +1,26 @@
# pylint: disable=no-self-use,invalid-name
import pytest

from allennlp.common import Params
from allennlp.common.util import ensure_list
from allennlp.data.dataset_readers import QangarooReader
from allennlp.common.testing import AllenNlpTestCase


class TestQangarooReader:
@pytest.mark.parametrize('lazy', (True, False))
def test_read_from_file(self, lazy):
reader = QangarooReader(lazy=lazy)
instances = ensure_list(reader.read(AllenNlpTestCase.FIXTURES_ROOT / 'data' / 'qangaroo.json'))
assert len(instances) == 2

assert [t.text for t in instances[0].fields['candidates'][3]] == ['german', 'confederation']
assert [t.text for t in instances[0].fields['query']] == ['country', 'sms', 'braunschweig']
assert [t.text for t in instances[0].fields['supports'][0][:3]] == ['The', 'North', 'German']
assert [t.text for t in instances[0].fields['answer']] == ['german', 'empire']
assert instances[0].fields['answer_index'].sequence_index == 4

def test_can_build_from_params(self):
reader = QangarooReader.from_params(Params({}))
# pylint: disable=protected-access
assert reader._token_indexers['tokens'].__class__.__name__ == 'SingleIdTokenIndexer'
24 changes: 24 additions & 0 deletions allennlp/tests/fixtures/data/babi.txt
@@ -0,0 +1,24 @@
1 Gertrude is a cat.
2 Cats are afraid of sheep.
3 Jessica is a sheep.
4 Mice are afraid of wolves.
5 Emily is a wolf.
6 Winona is a mouse.
7 Wolves are afraid of sheep.
8 Sheep are afraid of wolves.
9 What is Gertrude afraid of? sheep 1 2
10 What is Winona afraid of? wolf 4 6
11 What is Emily afraid of? sheep 5 7
12 What is Jessica afraid of? wolf 3 8
1 Mice are afraid of wolves.
2 Gertrude is a mouse.
3 Sheep are afraid of mice.
4 Winona is a cat.
5 Wolves are afraid of mice.
6 Emily is a sheep.
7 Jessica is a wolf.
8 Cats are afraid of mice.
9 What is Emily afraid of? mouse 3 6
10 What is Winona afraid of? mouse 4 8
11 What is Gertrude afraid of? wolf 1 2
12 What is Jessica afraid of? mouse 5 7
1 change: 1 addition & 0 deletions allennlp/tests/fixtures/data/qangaroo.json

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions doc/api/allennlp.data.dataset_readers.babi.rst
@@ -0,0 +1,7 @@
allennlp.data.dataset_readers.babi
==================================

.. automodule:: allennlp.data.dataset_readers.babi
:members:
:undoc-members:
:show-inheritance:
Expand Up @@ -20,6 +20,11 @@ allennlp.data.dataset_readers.reading_comprehension
:members:
:undoc-members:
:show-inheritance:

.. automodule:: allennlp.data.dataset_readers.reading_comprehension.qangaroo
:members:
:undoc-members:
:show-inheritance:

.. automodule:: allennlp.data.dataset_readers.reading_comprehension.util
:members:
Expand Down
1 change: 1 addition & 0 deletions doc/api/allennlp.data.dataset_readers.rst
Expand Up @@ -10,6 +10,7 @@ allennlp.data.dataset_readers

allennlp.data.dataset_readers.dataset_reader
allennlp.data.dataset_readers.dataset_utils
allennlp.data.dataset_readers.babi
allennlp.data.dataset_readers.ccgbank
allennlp.data.dataset_readers.conll2000
allennlp.data.dataset_readers.conll2003
Expand Down