Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Sql text utils #1717

Merged
merged 14 commits into from Sep 7, 2018
Merged

Sql text utils #1717

merged 14 commits into from Sep 7, 2018

Conversation

DeNeutoy
Copy link
Contributor

@DeNeutoy DeNeutoy commented Sep 6, 2018

Some utility functions for reading in the text2sql data. These are going to be iteratively improved (e.g there is some deduplication that can be done when reading the data, etc), but this is functional so I thought improve on it iteratively afterward.

This is abstracted from a specific data reader because there may well be several dataset readers associated with these datasets, for baselines etc.

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this could use at least a little more high-level documentation for someone who hasn't looked at Jonathan's repo, but other than that LGTM.

@@ -0,0 +1,137 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove blank line.

sql_variables: Dict[str, str]


def get_tokens(sentence: List[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces things like city_name0 with san francisco, right? Maybe name this replace_variables?

"""
sql_tokens = []
for token in sql.strip().split():
token = token.replace('"', "").replace('"', "").replace("%", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're replacing double quotes here twice, which doesn't do anything. Did you mean for one of these to be single quotes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


def clean_and_split_sql(sql: str) -> List[str]:
"""
Cleans up and unifies a SQL query. This involves removing uncessary quotes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/uncessary/unnecessary/

def clean_and_split_sql(sql: str) -> List[str]:
"""
Cleans up and unifies a SQL query. This involves removing uncessary quotes
and spliting brackets which aren't formatted consistently in the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/spliting/splitting/

dataset_bucket = "train"
else:
dataset_bucket = dataset_split
# Loop over the different sql statements with "equivelent" semantics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/equivelent/equivalent/

sql_variables=sql_variables)
yield (dataset_bucket, sql_data)

# Some questions might have multiple equivelent SQL statements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/equivelent/equivalent/

use_question_split,
cross_validation_split)
for dataset, instance in tagged_example:
if dataset == dataset_split:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means you're potentially reading the file multiple times to get the various splits from it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I haven't concretely figured out how i'm going to do this yet, because I might refactor the data to be separated into splits, rather than have all the splits in a single file, because this would play more nicely with allennlp.


# All of these data points are the same. This is weird, but currently correct.
for split, sql_data in dataset:
# This should test because in the data, the cross validation split is == 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"should be test"?

'city_name0', 'AND', 'RESTAURANTalias0.ID', '=', 'LOCATIONalias0.RESTAURANT_ID',
'AND', 'RESTAURANTalias0.NAME', '=', 'name0', ';']
assert sql_data.text_variables == {'city_name0': 'san francisco', 'name0': 'buttercup kitchen'}
assert sql_data.sql_variables == {'city_name0': 'san francisco', 'name0': 'buttercup kitchen'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: are the text variables and the sql variables ever different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question - this only appears in one of the datasets, "advising", which is extremely difficult (bordering on impossible).
More detail here:
jkkummerfeld/text2sql-data#12

@DeNeutoy DeNeutoy merged commit 72c9e98 into allenai:master Sep 7, 2018
@DeNeutoy DeNeutoy deleted the sql-text-utils branch September 7, 2018 21:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants