WTQ dataset reader and models moved from iterative-search-semparse #2764

pdasigi · 2019-04-26T21:57:28Z

Most of the tests are passing locally. I'll merge this after they all pass and after I evaluate the trained models with the new code and get the same numbers. Also, I suppose there are a bunch of things that can now be deleted (type declarations, SEMPRE related stuff...). I'll add those deletions to this PR as well.

pdasigi · 2019-04-30T22:32:48Z

@matt-gardner This PR is finally ready to be reviewed!

matt-gardner

This is awesome @pdasigi! It looks like it was a whole lot of work, thanks for doing this. I had a few questions, but it looks very good overall.

matt-gardner · 2019-05-01T15:53:20Z

allennlp/data/dataset_readers/semantic_parsing/wikitables/wikitables.py

@@ -32,28 +32,19 @@
 class WikiTablesDatasetReader(DatasetReader):
    """
    This ``DatasetReader`` takes WikiTableQuestions ``*.examples`` files and converts them into
-    ``Instances`` suitable for use with the ``WikiTablesSemanticParser``.  This reader also accepts
-    pre-processed JSONL files produced by ``scripts/preprocess_wikitables.py``.  Processing the


We put this caching into the base DatasetReader recently, which is (I assume) why you just removed this code; maybe also because the corenlp preprocessing is already done, and you're reading those by default? Have you tried using the caching functionality with this at all, or is it already fast enough that it doesn't matter?

I was not aware of the caching functionality. That sounds great! I removed this because I thought reading in the preprocessed files was not too slow.

matt-gardner · 2019-05-01T15:57:00Z

allennlp/data/dataset_readers/semantic_parsing/wikitables/wikitables.py

+        target_values_field = MetadataField(target_values)
+        world = WikiTablesLanguage(table_context)
+        world_field = MetadataField(world)
+        # Note: Not passing any featre extractors when instantiating the field below. This will make


You don't want this to be configurable anymore?

matt-gardner · 2019-05-01T16:03:59Z

allennlp/predictors/wikitables_parser.py

        # pylint: enable=protected-access
        instance = self._dataset_reader.text_to_instance(question_text,  # type: ignore
-                                                         table_rows,
-                                                         tokenized_question=tokenized_question)
+                                                         table_rows)


Does this need additional processing? Does this have to be formatted as a corenlp-tagged table?

It is just the raw table content. I added a comment here and updated the docstring of text_to_instance to make that clear.

allennlp/models/semantic_parsing/wikitables/wikitables_erm_semantic_parser.py

matt-gardner · 2019-05-01T16:11:52Z

allennlp/models/semantic_parsing/wikitables/wikitables_erm_semantic_parser.py

-        logical_form = world.get_logical_form(action_strings)
-        lisp_string = state.extras[batch_index]
-        if self._executor.evaluate_logical_form(logical_form, lisp_string):
+        logical_form = world.action_sequence_to_logical_form(action_strings)


We can execute action sequences directly, which is definitely recommended here, so you don't have to go back and forth between strings. It'll make things at least a little faster.

Done. Thanks! Added a methodWikiTablesLanguage.evaluate_action_sequence

matt-gardner · 2019-05-01T16:17:21Z

allennlp/models/semantic_parsing/wikitables/wikitables_semantic_parser.py

+                    action_strings = [action_mapping[(i, action_index)] for action_index in action_indices]
+                    has_logical_form = False
+                    try:
+                        logical_form = world[i].action_sequence_to_logical_form(action_strings)


Here, too, just execute the action sequence directly, instead of converting to a logical form first. If we need to add a world.evaluate_action_sequence(actions, targets), that's fine.

It'll simplify the logic below, also, as you don't need to worry about a ParsingError.

But we're also measuring whether the parser is producing valid logical forms (has_logical_form), for which we need to make the logical forms. The value of that metric is usually 1.0, but its nice to have that additional check.

That should already be guaranteed by construction - we only allow actions that produce valid logical forms. I don't think it's worth slowing down training to test something that should have been tested in a unit test.

We'll also get ExecutionErrors if something was up with the action sequence, which is another place you could catch this. This would catch two kinds of issues, which are the only ones that we should run into, anyway: (1) cases where we were loose in our type system, but have stricter execution requirements, like around lists vs. single items; and (2) cases where the action sequence didn't terminate, but we're trying to evaluate it anyway (keeping unfinished sequences at the end of beam search).

I was referring to action sequences that got terminated early hence producing incomplete logical forms. Do you have an estimate of how much slow down this could cause?

No, but anything you do on the CPU is time that's not utilizing the GPU for training. And those action sequences will be caught with execution errors anyway, so there's no reason to go to a logical form.

Or you could just instruct the beam search to not include unfinished sequences, which is probably the right thing to do, anyway...

Okay, removing the has_logical_form metric.

Hmm.. one more thing. We do need to show the logical form in the demo, and to be able to convert the action sequence into a logical form, we need the corresponding world objects. I think this is the best place to make a logical form and store it in the output dict. Also, this method is only called during validation, so should not significantly slow down training.

Hmm, yeah, fair enough, that's a good point. You could pass the world object in the output dictionary, and create the logical form in decode, but it's probably fine to just put it here during validation.

matt-gardner · 2019-05-01T16:20:25Z

allennlp/semparse/contexts/table_question_context.py

@@ -163,17 +163,14 @@ def get_table_knowledge_graph(self) -> KnowledgeGraph:
            self._table_knowledge_graph = KnowledgeGraph(entities, dict(neighbors), entity_text)
        return self._table_knowledge_graph

+    # TODO (pradeep): Make a ``read_from_json`` method similar to what we had in ``TableQuestionKnowledgeGraph``


Do we need that for a demo?

I don't think so. Removing that TODO.

matt-gardner · 2019-05-01T16:21:44Z

allennlp/semparse/contexts/table_question_context.py

+                                           lines: List[List[str]]) -> Tuple[List[Dict[str, Dict[str, str]]],
+                                                                            Dict[str, Set[str]]]:
+        """
+        This method will be called only when we do not have tagged information from CoreNLP. That is, when we are


Is it possible to just call corenlp's python API here instead of this? Or would that require a call to java, which we're trying to avoid? It probably would...

I don't think we're missing out on a lot by not using CoreNLP here.

matt-gardner · 2019-05-01T16:27:36Z

allennlp/semparse/domain_languages/wikitables_language.py

@@ -320,6 +320,21 @@ def get_agenda(self,
                agenda.append(f"Number -> {number}")
        return agenda

+    @staticmethod
+    def is_instance_specific_entity(entity_name: str) -> bool:


Hmm, seems like we should have a function for this on the base class, but maybe it doesn't fit... I'll think about this some more.

schmmd · 2019-05-30T18:21:58Z

@pdasigi would like to train a model with this code before merging this PR.

pdasigi · 2019-06-17T22:31:40Z

I spent a lot of time trying to reproduce results from the best model we got using the code in iterative-search-semparse (reported in this paper). I made some progress in getting there, but it's not fully done. I think the biggest discrepancy still remaining is that eneric types (like ComparableColumn) are handled differently here while inducing a grammar. In the current logic in this repo, we add terminals of multiple types, whereas there I had hierarchical types.

I may find other minor differences after I fix these two issues.

That said, I think reproducing the result is beyond the scope of this PR. Given that this PR is already huge, and it has important changes (like removing jdk dependency), I've decided to merge this now and open another PR with changes that ensure we get the same result. Any objections @matt-gardner?

matt-gardner · 2019-06-17T22:45:07Z

Is the result close?

pdasigi · 2019-06-20T18:31:10Z

@matt-gardner Update: I trained an MML model, followed by an ERM model initialized from it using this code. The dev accuracies are both only 1 point behind those from comparable models using the old code. I think that's close enough, and I'm going to merge this PR (after resolving conflicts). I'll spend some time running experiments to try out the things we discussed offline yesterday, and if they give better results, I'll open a new PR.

…llenai#2764) * added dataset reader * moved dataset reader * moved models * removed unnecessary except * fixed the name of the model in docs * change trainer_test * minor fixes * removed unnecessary modules * removed unnecessary tests * removed table knowledge graph * removed table knowledge graph * updated knowledge graph field test with new context * fixed docs after removing table knowledge graph * got the predictor tests passing * fixed a bug in checking if data is tagged * minor fixes * addressed PR comments and added a test for reading untagged tables * fixed a silly doc issue * removed jdk dependency * more language changes * undid changes made for backwards compatibility * pylint and mypy fixes * better error raising

pdasigi changed the title ~~Dataset reader and models moved from iterative-search-semparse~~ WTQ dataset reader and models moved from iterative-search-semparse Apr 26, 2019

pdasigi mentioned this pull request Apr 30, 2019

Using WikitablesQuestion Semantic Parser Demo programatically #2771

Closed

pdasigi requested a review from matt-gardner April 30, 2019 22:30

matt-gardner mentioned this pull request May 1, 2019

AllenNLP is not using GPU, On enabling - CUDA error: initialization error #2756

Closed

matt-gardner reviewed May 1, 2019

View reviewed changes

matt-gardner approved these changes May 1, 2019

View reviewed changes

schmmd mentioned this pull request May 10, 2019

Install jre instead of jdk #2691

Closed

matt-gardner mentioned this pull request May 15, 2019

Where to download dataset such as ATIS and WikitableQuestions? #2845

Closed

pdasigi force-pushed the wtq_variable_free_model branch 2 times, most recently from 146983f to 457d8da Compare June 17, 2019 21:50

pdasigi force-pushed the wtq_variable_free_model branch from cbfc8be to 49cac30 Compare June 17, 2019 22:32

pdasigi added 14 commits June 20, 2019 11:33

added dataset reader

e222a01

moved dataset reader

c29ead2

moved models

40a0770

removed unnecessary except

147002d

fixed the name of the model in docs

6262daa

change trainer_test

1447ee4

minor fixes

eb6db8a

removed unnecessary modules

64d510b

removed unnecessary tests

9af6a40

removed table knowledge graph

e6a23f5

removed table knowledge graph

222eaea

updated knowledge graph field test with new context

5a40d86

fixed docs after removing table knowledge graph

4b1d1d7

got the predictor tests passing

75de5db

pdasigi added 9 commits June 20, 2019 11:35

fixed a bug in checking if data is tagged

9dceaf6

minor fixes

7053511

addressed PR comments and added a test for reading untagged tables

5b2c10e

fixed a silly doc issue

82fa6f8

removed jdk dependency

5be31e0

more language changes

350a054

undid changes made for backwards compatibility

fc90ed1

pylint and mypy fixes

2cb528a

better error raising

8a6a12a

pdasigi force-pushed the wtq_variable_free_model branch from 49cac30 to 8a6a12a Compare June 20, 2019 18:35

Merge branch 'master' into wtq_variable_free_model

e4a6e12

pdasigi merged commit 0fbd1ca into allenai:master Jun 20, 2019

pdasigi deleted the wtq_variable_free_model branch June 20, 2019 19:21

matt-gardner mentioned this pull request Jun 24, 2019

AllenNLP returns blank results for few questions with same format in same document #2988

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WTQ dataset reader and models moved from iterative-search-semparse #2764

WTQ dataset reader and models moved from iterative-search-semparse #2764

pdasigi commented Apr 26, 2019 •

edited by schmmd

pdasigi commented Apr 30, 2019

matt-gardner left a comment

matt-gardner May 1, 2019

pdasigi May 1, 2019

matt-gardner May 1, 2019

matt-gardner May 1, 2019

pdasigi May 1, 2019

matt-gardner May 1, 2019

pdasigi May 1, 2019

matt-gardner May 1, 2019

pdasigi May 1, 2019

matt-gardner May 1, 2019

matt-gardner May 1, 2019

pdasigi May 1, 2019

matt-gardner May 1, 2019

matt-gardner May 1, 2019

pdasigi May 1, 2019

pdasigi May 1, 2019

matt-gardner May 1, 2019

matt-gardner May 1, 2019

pdasigi May 1, 2019

matt-gardner May 1, 2019

pdasigi May 1, 2019

matt-gardner May 1, 2019

schmmd commented May 30, 2019

pdasigi commented Jun 17, 2019 •

edited

matt-gardner commented Jun 17, 2019

pdasigi commented Jun 20, 2019

WTQ dataset reader and models moved from iterative-search-semparse #2764

WTQ dataset reader and models moved from iterative-search-semparse #2764

Conversation

pdasigi commented Apr 26, 2019 • edited by schmmd

pdasigi commented Apr 30, 2019

matt-gardner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmmd commented May 30, 2019

pdasigi commented Jun 17, 2019 • edited

matt-gardner commented Jun 17, 2019

pdasigi commented Jun 20, 2019

pdasigi commented Apr 26, 2019 •

edited by schmmd

pdasigi commented Jun 17, 2019 •

edited