This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Sql text utils #1717

Merged

DeNeutoy merged 14 commits into allenai:master from DeNeutoy:sql-text-utils

Sep 7, 2018

Contributor

DeNeutoy commented Sep 6, 2018

Some utility functions for reading in the text2sql data. These are going to be iteratively improved (e.g there is some deduplication that can be done when reading the data, etc), but this is functional so I thought improve on it iteratively afterward.

This is abstracted from a specific data reader because there may well be several dataset readers associated with these datasets, for baselines etc.

Mark Neumann and others added 7 commits

August 31, 2018 12:30


          add functions for text to sql data reading

bbeee81


          most of the text2sql utils, test shells

56c4af7


          fix test leakage bug, add more tests

84d566f


          add some mildly odd looking tests for sql utils

6517ff9


          Merge branch 'master' into sql-text-utils

32a0c2e


          add doc

ae07392


          Merge branch 'sql-text-utils' of https://github.com/DeNeutoy/allennlp …

ee2e430

…into sql-text-utils

DeNeutoy requested a review from matt-gardner

September 6, 2018 05:07

matt-gardner approved these changes

View reviewed changes

Contributor

matt-gardner left a comment

Feels like this could use at least a little more high-level documentation for someone who hasn't looked at Jonathan's repo, but other than that LGTM.

allennlp/data/dataset_readers/dataset_utils/text2sql_utils.py

		@@ -0,0 +1,137 @@

Contributor

matt-gardner Sep 6, 2018

Remove blank line.

allennlp/data/dataset_readers/dataset_utils/text2sql_utils.py Outdated

		sql_variables: Dict[str, str]


		def get_tokens(sentence: List[str],

Contributor

matt-gardner Sep 6, 2018

This replaces things like city_name0 with san francisco, right? Maybe name this replace_variables?

allennlp/data/dataset_readers/dataset_utils/text2sql_utils.py Outdated

+                  """
+                  sql_tokens = []
+                  for token in sql.strip().split():
+                      token = token.replace('"', "").replace('"', "").replace("%", "")

Contributor

matt-gardner Sep 6, 2018

You're replacing double quotes here twice, which doesn't do anything. Did you mean for one of these to be single quotes?

Contributor Author

DeNeutoy Sep 6, 2018

Thanks!

allennlp/data/dataset_readers/dataset_utils/text2sql_utils.py Outdated

+              def clean_and_split_sql(sql: str) -> List[str]:
+                  """
+                  Cleans up and unifies a SQL query. This involves removing uncessary quotes

Contributor

matt-gardner Sep 6, 2018

s/uncessary/unnecessary/

allennlp/data/dataset_readers/dataset_utils/text2sql_utils.py Outdated

+              def clean_and_split_sql(sql: str) -> List[str]:
+                  """
+                  Cleans up and unifies a SQL query. This involves removing uncessary quotes
+                  and spliting brackets which aren't formatted consistently in the data.

Contributor

matt-gardner Sep 6, 2018

s/spliting/splitting/

allennlp/data/dataset_readers/dataset_utils/text2sql_utils.py Outdated

+                              dataset_bucket = "train"
+                      else:
+                          dataset_bucket = dataset_split
+                      # Loop over the different sql statements with "equivelent" semantics

Contributor

matt-gardner Sep 6, 2018

s/equivelent/equivalent/

allennlp/data/dataset_readers/dataset_utils/text2sql_utils.py Outdated

+                                             sql_variables=sql_variables)
+                          yield (dataset_bucket, sql_data)
+                          # Some questions might have multiple equivelent SQL statements.

Contributor

matt-gardner Sep 6, 2018

s/equivelent/equivalent/

allennlp/data/dataset_readers/dataset_utils/text2sql_utils.py Outdated

+                                                                 use_question_split,
+                                                                 cross_validation_split)
+                          for dataset, instance in tagged_example:
+                              if dataset == dataset_split:

Contributor

matt-gardner Sep 6, 2018

This means you're potentially reading the file multiple times to get the various splits from it?

Contributor Author

DeNeutoy Sep 6, 2018

Yes - I haven't concretely figured out how i'm going to do this yet, because I might refactor the data to be separated into splits, rather than have all the splits in a single file, because this would play more nicely with allennlp.

allennlp/tests/data/dataset_readers/dataset_utils/text2sql_utils_test.py Outdated

+                      # All of these data points are the same. This is weird, but currently correct.
+                      for split, sql_data in dataset:
+                          # This should test because in the data, the cross validation split is == 1.

Contributor

matt-gardner Sep 6, 2018

"should be test"?

allennlp/tests/data/dataset_readers/dataset_utils/text2sql_utils_test.py

+                                                  'city_name0', 'AND', 'RESTAURANTalias0.ID', '=', 'LOCATIONalias0.RESTAURANT_ID',
+                                                  'AND', 'RESTAURANTalias0.NAME', '=', 'name0', ';']
+                          assert sql_data.text_variables == {'city_name0': 'san francisco', 'name0': 'buttercup kitchen'}
+                          assert sql_data.sql_variables == {'city_name0': 'san francisco', 'name0': 'buttercup kitchen'}

Contributor

matt-gardner Sep 6, 2018

Just curious: are the text variables and the sql variables ever different?

Contributor Author

DeNeutoy Sep 7, 2018

good question - this only appears in one of the datasets, "advising", which is extremely difficult (bordering on impossible).
More detail here:
jkkummerfeld/text2sql-data#12

Mark Neumann and others added 7 commits

September 7, 2018 10:49


          PR feedback

6d1e178


          Merge branch 'master' into sql-text-utils

b1284d7


          add script to reformat sql data

2710ef6


          update docstring

3f1098d


          Merge branch 'sql-text-utils' of https://github.com/DeNeutoy/allennlp …

bd4b205

…into sql-text-utils


          refactor to use much simpler data format split out by directory

6219d64


          Merge branch 'master' into sql-text-utils

07b02d4

DeNeutoy merged commit 72c9e98 into allenai:master

DeNeutoy deleted the sql-text-utils branch

September 7, 2018 21:43

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.