feat: helper functions for RLS #19055

betodealmeida · 2022-03-08T02:28:29Z

SUMMARY

This PR introduces 2 helper functions for RLS:

has_table_query(statement: Statement) -> bool:, which analyzes a SQL statement parsed bysqlparse and returns True if it queries a table (either in a FROM or JOIN).
insert_rls(token_list: TokenList, table: str, rls: TokenList) -> TokenList:, which given a parsed statement (or token list), a table name, and a parsed RLS expression, will reformat the query so that the RLS is present in any queries or subqueries referencing the table.

For example, if we have the RLS expression id=42 on the table my_table we could do:

>>> statement = sqlparse.parse('SELECT COUNT(*) FROM my_table')[0]
>>> has_table_query(statement)
True
>>> rls = sqlparse.parse('id=42')[0]
>>> print(insert_rls(statement, 'my_table', rls)
SELECT COUNT(*) FROM my_table WHERE my_table.id=42

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A

TESTING INSTRUCTIONS

I added unit tests covering different cases:

Regular queries
Nested queries
Joins
Union
Fully qualified table names (eg, schema.table_name)
RLS already in the SQL

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

codecov · 2022-03-08T02:53:00Z

Codecov Report

Merging #19055 (00598fa) into master (77063cc) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #19055      +/-   ##
==========================================
+ Coverage   66.56%   66.61%   +0.05%     
==========================================
  Files        1641     1641              
  Lines       63495    63583      +88     
  Branches     6425     6425              
==========================================
+ Hits        42265    42358      +93     
+ Misses      19550    19545       -5     
  Partials     1680     1680

Flag	Coverage Δ
hive	`52.53% <11.26%> (-0.07%)`	⬇️
mysql	`81.87% <100.00%> (+0.06%)`	⬆️
postgres	`81.92% <100.00%> (+0.06%)`	⬆️
presto	`52.38% <11.26%> (-0.07%)`	⬇️
python	`82.36% <100.00%> (+0.06%)`	⬆️
sqlite	`81.67% <100.00%> (+0.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/sql_parse.py	`98.95% <100.00%> (+0.34%)`	⬆️
superset/viz.py	`58.26% <0.00%> (ø)`
superset/config.py	`91.82% <0.00%> (ø)`
superset/reports/commands/execute.py	`91.51% <0.00%> (ø)`
superset/databases/schemas.py	`98.47% <0.00%> (+<0.01%)`	⬆️
superset/utils/core.py	`90.25% <0.00%> (+0.02%)`	⬆️
superset/commands/exceptions.py	`92.98% <0.00%> (+0.25%)`	⬆️
superset/security/manager.py	`94.72% <0.00%> (+0.31%)`	⬆️
superset/common/query_context_processor.py	`91.46% <0.00%> (+0.47%)`	⬆️
superset/db_engine_specs/databricks.py	`90.90% <0.00%> (+0.90%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77063cc...00598fa. Read the comment docs.

superset/sql_parse.py

john-bodley · 2022-03-08T21:34:58Z

superset/sql_parse.py

@@ -458,3 +460,178 @@ def validate_filter_clause(clause: str) -> None:
                )
    if open_parens > 0:
        raise QueryClauseValidationException("Unclosed parenthesis in filter clause")
+
+
+def has_table_query(statement: Statement) -> bool:


@betodealmeida there's also this example which has logic for identifying tables.

I don't remember the details, but I've had issues with that example code before — I think it failed to identify table names when they were considered keywords (even though the example calls it out).

john-bodley · 2022-03-08T21:40:27Z

superset/sql_parse.py

+    """
+    seen_source = False
+    tokens = statement.tokens[:]
+    while tokens:


You likely could just do for token in stmt.flatten(): and remove the logic from lines 483–485.

.flatten() is a bit different in that it returns the leaf nodes only, converting an identifier into 1+ Name tokens:

>>> list(sqlparse.parse('SELECT * FROM my_table')[0].flatten()) [<DML 'SELECT' at 0x10FF019A0>, <Whitespace ' ' at 0x10FF01D00>, <Wildcard '*' at 0x10FF01D60>, <Whitespace ' ' at 0x10FF01DC0>, <Keyword 'FROM' at 0x10FF01E20>, <Whitespace ' ' at 0x10FF01E80>, <Name 'my_tab...' at 0x10FF01EE0>]

Since I'm looking for identifiers after a FROM or JOIN I thought it was easier to implement a traversal logic that actually inspects the parents, not just the leaves.

john-bodley · 2022-03-08T21:47:34Z

superset/sql_parse.py

+
+        if token.ttype == Keyword and token.value.lower() in ("from", "join"):
+            seen_source = True
+        elif seen_source and (


The challenge here is there's no strong connection to ensure that the consecutive (or near consecutive) tokens are those which are being identified here. I guess the question is how robust do we want this logic. The proposed solution may well we suffice.

The correct way of doing this is more of a tree traversal (as opposed to a flattened list) where one checks the next token (which could be a group) from the FROM or JOIN keyword and iterate from there.

My sense is that can likely be addressed later. We probably need to cleanup the sqlparse logic to junk it completely in favor of something else given that it seems like the package is somewhat on life support.

Yeah, in the insert_rls function I had to implement tree traversal to get it right. Let me give it a try rewriting this one.

@john-bodley I reimplemented it following the same logic as insert_rls (recursive tree traversal instead of flattening).

john-bodley · 2022-03-08T21:53:51Z

superset/sql_parse.py

+    Modify a RLS expression ensuring columns are fully qualified.
+    """
+    tokens = rls.tokens[:]
+    while tokens:


You likely could use flatten here. It uses a generator so likely a copy should be made given you're mutating the tokens, i.e.,

for token in list(rls.flatten()): if imt(token, i=Identifier) and token.get_parent_name() is None: ...

Same issue, if we call .flatten() we would never get an Identifier.

superset/sql_parse.py

tests/unit_tests/sql_parse_tests.py

suddjian · 2022-03-09T19:28:12Z

superset/sql_parse.py

+                i, Where([Token(Keyword, "WHERE"), Token(Whitespace, " "), rls]),
+            )
+
+            # Right pad with space, if needed


why does sqlparse even tokenize whitespace?

I think it's because it makes it easier to convert the parse tree back to a string. Not sure.

superset/sql_parse.py

betodealmeida · 2022-03-10T17:45:12Z

@suddjian I modified the logic to always include the RLS even if it's already present, since there are a few corner cases that are hard to identify. For example, if we have the RLS user_id=1 and this query:

SELECT * FROM table
WHERE TRUE OR user_id=1

Even though we already have the token Comparison(user_id=1) in the WHERE clause we still need to apply since in this case the comparison is a no-op. So we need to add it:

SELECT * FROM table
WHERE TRUE OR user_id=1 AND user_id=1

More importantly, because of the precedence of AND over OR, we need to wrap the original predicate in parenthesis:

SELECT * FROM table
WHERE (TRUE OR user_id=1) AND user_id=1

Without parenthesis the predicate evaluates to TRUE OR (user_id=1 AND user_id=1), which bypasses the RLS!

I implemented the logic to wrap the original predicate and added tests covering it.

villebro

Very nice! A few comments with a potential false positive, but other than that looks really good 👍

superset/sql_parse.py

tests/unit_tests/sql_parse_tests.py

betodealmeida · 2022-03-11T20:23:43Z

Addressed all of @villebro's comments.

suddjian

awesome

* feat: helper functions for RLS * Add function to inject RLS * Add UNION tests * Add tests for schema * Add more tests; cleanup * has_table_query via tree traversal * Wrap existing predicate in parenthesis * Clean up logic * Improve table matching (cherry picked from commit 8234395)

betodealmeida added 2 commits March 7, 2022 15:38

feat: helper functions for RLS

eca77d7

Add function to inject RLS

11c2bb3

pull-request-size bot added the size/L label Mar 8, 2022

betodealmeida requested review from villebro, john-bodley and suddjian and removed request for villebro March 8, 2022 02:28

betodealmeida added 2 commits March 7, 2022 19:54

Add UNION tests

6942b9e

Add tests for schema

746a919

john-bodley reviewed Mar 8, 2022

View reviewed changes

betodealmeida added 2 commits March 8, 2022 14:21

Add more tests; cleanup

5b32f9b

has_table_query via tree traversal

b5bfb94