Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LIKE clause #67

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Expand Up @@ -93,7 +93,7 @@ Right now, the focus is on building a command-line tool that follows these core
SELECT [ DISTINCT | PARTIALS ]
[ * | python_expression [ AS output_column_name ] [, ...] ]
[ FROM csv | spy | text | python_expression | json [ EXPLODE path ] ]
[ WHERE python_expression ]
[ WHERE python_expression [ [NOT] LIKE string] ]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I wouldn't change the description of the query structure. Let's assume we are adding a Python operator (and therefore it fits into a python_expression). We would then highlight this in the documentation.
  2. I would broaden the use of the LIKE to any python_expression. Example of a use of LIKE outside of the WHERE clause: SELECT 'error' if msg like 'error%' else 'OK'
  3. I would consider also adding ILIKE, which has the same behaviour as LIKE but it is case-insensitive

[ GROUP BY output_column_number | python_expression [, ...] ]
[ ORDER BY output_column_number | python_expression
[ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ]
Expand Down
41 changes: 39 additions & 2 deletions spyql/cli.py
Expand Up @@ -207,6 +207,42 @@ def parse_select(sel, strings):
return res, has_distinct, has_partials


def parse_wherelike(clause, strings):
"""splits the LIKE clause and completely supports the SQL syntax
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/like-transact-sql?view=sql-server-ver15"""
# We're not in a LIKE expression, do nothing
if not re.search("LIKE", clause):
return clause

# Supports words containing [a-zA-Z0-9_\-]
expr_pattern = re.compile(r"([\w-]+)(?:\s+(NOT))?\s+LIKE\s+([\w-]+)", re.IGNORECASE)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this regex does not do the trick. It fails in cases like:

  • col1 + col2 LIKE 'constant%'
  • 'constant%' LIKE col1
  • func(col1) LIKE col1 + '%'

I think we have one of the following options:

  • do not implement a LIKE operator, but simple make available a function like(a, b). Instead of col1 LIKE 'constant%' we would write like(col1, 'constant%')
  • detect occurrences of LIKE and replace them in the query string by something like col1 | like_op | 'constant%'. like_op would be a class that overloads the or operator that does the LIKE magic: given 2 strings, parses the strings for detecting '%' and '_' does the comparison and returns a boolean. This would be a little more trouble and requires particular attention to operator prioritisation and a compete test suite. We would need 4 operators: LIKE, NOT LIKE, ILIKE, NOT ILIKE. Look here: https://stackoverflow.com/a/56739916/9522280

I am fine with both approaches. I would be happy on having the first at short-term and the second at longer term.

groups = re.search(expr_pattern, clause)
if groups is None:
spyql.log.user_error(
f"{clause}",
SyntaxError("unexpected EOF while parsing")
)

groups = groups.groups()
negate = "NOT" in {groups[1]} # placed within {} because it can be None

if not groups[2] in strings:
spyql.log.user_error(
f"{groups[2]}: missing quotes, must be a string",
SyntaxError("bad query")
)

# Replacing SQL wildcard '%' for regex wildcard '.*' if not preceded by '\'
pattern = strings.put_strings_back(groups[2])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we only accept LIKE wildcards in the right side?
I think we can implement a single side LIKE because it might be simpler to implement. One option would be to test if the wildcards are on the left or right side and swap if needed, while raising an error when there are wildcards on both sides.

pattern = re.compile(r"(?<!\\)%").sub(r".*" , pattern)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :-)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we first escape any regex special character/command? For instance col1 like '123.456.%' would accept '1233456oops' because of the meaning of . in regex patterns...

pattern = re.compile(r"([^\"].*[^\"])").sub(r"^\1$", pattern)

clause = "re.match({}, str({}))".format(pattern, groups[0])
clause = "not " + clause if negate else clause

return clause


def parse_orderby(clause, strings):
"""splits the ORDER BY clause and handles modifiers"""

Expand Down Expand Up @@ -275,9 +311,10 @@ def parse(query):
"order by",
}:
if prs[clause]:
prs[clause] = make_expr_ready(prs[clause], strings)
if clause in {"where", "from"}:
throw_error_if_has_agg_func(prs[clause], clause.upper())
prs[clause] = make_expr_ready(prs[clause], strings)
prs[clause] = parse_wherelike(prs[clause], strings)

for clause in {"group by"}:
if prs[clause]:
Expand Down Expand Up @@ -400,7 +437,7 @@ def main(query, warning_flag, verbose, unbuffered, input_opt, output_opt):
SELECT [ DISTINCT | PARTIALS ]
[ * | python_expression [ AS output_column_name ] [, ...] ]
[ FROM csv | spy | text | python_expression | json [ EXPLODE path ] ]
[ WHERE python_expression ]
[ WHERE python_expression [ [NOT] LIKE string] ]
[ GROUP BY output_column_number | python_expression [, ...] ]
[ ORDER BY output_column_number | python_expression
[ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ]
Expand Down
3 changes: 3 additions & 0 deletions spyql/quotes_handler.py
Expand Up @@ -9,6 +9,9 @@ class QuotesHandler:
def __init__(self):
self.strings = {}

def __iter__(self):
return iter(self.strings)
Comment on lines +12 to +13
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!


# replaces quoted strings by placeholders to make parsing easier
# populates dictionary of placeholders and the strings they hold
def extract_strings(self, query):
Expand Down
74 changes: 74 additions & 0 deletions tests/main_test.py
Expand Up @@ -193,6 +193,78 @@ def test_basic():
)


def test_wherelike():
base_data = """abc,def
test1,a
test2,a
bla,a
"""

# where like clause
eq_test_1row(
'SELECT * FROM range(3) WHERE col1 LIKE "1"', {"col1": 1}
)

# not matching
eq_test_nrows(
'SELECT * FROM range(3) WHERE col1 LIKE "5"', []
)

# where not like clause
eq_test_nrows(
'SELECT * FROM range(3) WHERE col1 NOT LIKE "1"', [{"col1": 0}, {"col1": 2}]
)

# non matching string
eq_test_nrows(
'SELECT abc FROM csv WHERE abc LIKE "x"',
[],
data=base_data
)

# matching string
eq_test_nrows(
'SELECT abc FROM csv WHERE abc LIKE "test1"',
[{"abc": "test1"}],
data=base_data
)

# wildcard in end
eq_test_nrows(
'SELECT abc FROM csv WHERE abc LIKE "test%"',
[{"abc": "test1"}, {"abc": "test2"}],
data=base_data
)

# wildcard in start
eq_test_nrows(
'SELECT abc FROM csv WHERE abc LIKE "%test"',
[{"abc": "1test"}, {"abc": "2test"}],
data=base_data+"1test,a\n2test,a\n"
)

# wildcard in start and end
eq_test_nrows(
'SELECT abc FROM csv WHERE abc LIKE "%test%"',
[{"abc": "test1"}, {"abc": "test2"}, {"abc": "1test1"}, {"abc": "2test2"}],
data=base_data+"1test1,a\n2test2,a\n"
)

# wildcard escaping
eq_test_nrows(
r'SELECT abc FROM csv WHERE abc LIKE "bla\\%bla"',
[{"abc": "bla%bla"}],
data=base_data+"bla%bla,a\n"
)

# wildcards only
eq_test_nrows(
r'SELECT abc FROM csv WHERE abc LIKE "%\\%%"',
[{"abc": "bla%bla"}],
data=base_data+"bla%bla,a\n"
)


def test_orderby():
# order by (1 col)
eq_test_nrows(
Expand Down Expand Up @@ -773,6 +845,8 @@ def test_errors():
exception_test("SELECT DISTINCT count_agg(1)", SyntaxError)
exception_test("SELECT count_agg(1) GROUP BY 1", SyntaxError)
exception_test("SELECT 1 FROM range(3) WHERE max_agg(col1) > 0", SyntaxError)
exception_test("SELECT * from range(3) WHERE col1 LIKE 1", SyntaxError)
exception_test("SELECT * from range(3) WHERE col1 LIKE", SyntaxError)


def test_sql_output():
Expand Down