Conversation
…into sql-text-utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this could use at least a little more high-level documentation for someone who hasn't looked at Jonathan's repo, but other than that LGTM.
@@ -0,0 +1,137 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove blank line.
sql_variables: Dict[str, str] | ||
|
||
|
||
def get_tokens(sentence: List[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This replaces things like city_name0
with san francisco
, right? Maybe name this replace_variables
?
""" | ||
sql_tokens = [] | ||
for token in sql.strip().split(): | ||
token = token.replace('"', "").replace('"', "").replace("%", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're replacing double quotes here twice, which doesn't do anything. Did you mean for one of these to be single quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
||
def clean_and_split_sql(sql: str) -> List[str]: | ||
""" | ||
Cleans up and unifies a SQL query. This involves removing uncessary quotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/uncessary/unnecessary/
def clean_and_split_sql(sql: str) -> List[str]: | ||
""" | ||
Cleans up and unifies a SQL query. This involves removing uncessary quotes | ||
and spliting brackets which aren't formatted consistently in the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/spliting/splitting/
dataset_bucket = "train" | ||
else: | ||
dataset_bucket = dataset_split | ||
# Loop over the different sql statements with "equivelent" semantics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/equivelent/equivalent/
sql_variables=sql_variables) | ||
yield (dataset_bucket, sql_data) | ||
|
||
# Some questions might have multiple equivelent SQL statements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/equivelent/equivalent/
use_question_split, | ||
cross_validation_split) | ||
for dataset, instance in tagged_example: | ||
if dataset == dataset_split: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means you're potentially reading the file multiple times to get the various splits from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I haven't concretely figured out how i'm going to do this yet, because I might refactor the data to be separated into splits, rather than have all the splits in a single file, because this would play more nicely with allennlp.
|
||
# All of these data points are the same. This is weird, but currently correct. | ||
for split, sql_data in dataset: | ||
# This should test because in the data, the cross validation split is == 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"should be test"?
'city_name0', 'AND', 'RESTAURANTalias0.ID', '=', 'LOCATIONalias0.RESTAURANT_ID', | ||
'AND', 'RESTAURANTalias0.NAME', '=', 'name0', ';'] | ||
assert sql_data.text_variables == {'city_name0': 'san francisco', 'name0': 'buttercup kitchen'} | ||
assert sql_data.sql_variables == {'city_name0': 'san francisco', 'name0': 'buttercup kitchen'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: are the text variables and the sql variables ever different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question - this only appears in one of the datasets, "advising", which is extremely difficult (bordering on impossible).
More detail here:
jkkummerfeld/text2sql-data#12
…into sql-text-utils
Some utility functions for reading in the text2sql data. These are going to be iteratively improved (e.g there is some deduplication that can be done when reading the data, etc), but this is functional so I thought improve on it iteratively afterward.
This is abstracted from a specific data reader because there may well be several dataset readers associated with these datasets, for baselines etc.