Distinguish between inner and semijoins in `QueryExpr` AST. #969

gefjon · 2024-03-13T20:40:31Z

Description of Changes

This commit adds a flag semi: bool to JoinExpr, which signifies a semijoin, as opposed to an inner join.

A new optimization pass, QueryExpr::try_semi_join, is defined which can detect a certain common case of inner joins and rewrite them into semijoins.

The punchline here is that core::vm::join_inner used to accept a flag semi: bool which it could use to avoid some expensive Header mutations, but that flag was always passed as false because we had no way to distinguish semijoins. With this commit, the flag is actually used,
so evaluating non-indexed semijoins should avoid allocating a new Header.

API and ABI breaking changes

N/a

Expected complexity level and risk

3 - our query planner and evaluator depend strongly (more strongly than they should) on specific parses and representations in some places. I believe I found all such places and changed them, but am not confident.

This commit adds a flag `semi: bool` to `JoinExpr`, which signifies a semijoin, as opposed to an inner join. A new optimization pass, `QueryExpr::try_semi_join`, is defined which can detect a certain common case of inner joins and rewrite them into semijoins. The punchline here is that `core::vm::join_inner` used to accept a flag `semi: bool` which it could use to avoid some expensive `Header` mutations, but that flag was always passed as `false` because we had no way to distinguish semijoins. With this commit, the flag is actually used, so evaluating non-indexed semijoins should avoid allocating a new `Header`.

crates/core/src/vm.rs

joshua-spacetime · 2024-03-14T16:52:44Z

crates/core/src/vm.rs

+        let mut sources = SourceSet::default();
+        let rhs_source_expr = sources.add_mem_table(data);
+
+        let q = query(&schema).with_join_inner(rhs_source_expr, FieldName::positional("inventory", 0), rhs, true);


I think I would prefer a slightly higher level test that goes through the entire optimizer.

As you pointed out, we already have test coverage of this part, so these are redundant. Feel free to remove.

joshua-spacetime · 2024-03-14T17:01:14Z

crates/vm/src/eval.rs

+        let source_expr = sources.add_mem_table(table.clone());
+        let second_source_expr = sources.add_mem_table(table);
+
+        let q = query(source_expr).with_join_inner(second_source_expr, field.clone(), field, true);


Again, I think I would prefer a higher level test that goes from sql to result set. You can use RelationalDB::create_table_for_test along with sql::execute::run.

crates/vm/src/expr.rs

joshua-spacetime

~~I just want to request one more test case. That we correctly transform an IndexJoin between delta tables to the corresponding semijoin.~~

Correction: we already have test coverage for this. If you can update the other tests, this should be good to merge.

crates/vm/src/expr.rs

joshua-spacetime · 2024-03-14T18:57:25Z

crates/vm/src/expr.rs

+
+        let q = QueryExpr {
+            source: lhs_source,
+            // Build the query manually, because `.with_select` will attempt to push selections before the join.


This is unfortunate. All of this should be included as part of optimize. But of course this is a preexisting issue.

joshua-spacetime · 2024-03-14T19:02:55Z

But I should add that this change does improve the performance of incremental join exactly as expected.

incr-join               time:   [3.1750 µs 3.1777 µs 3.1807 µs]
                        change: [-27.737% -27.633% -27.519%] (p = 0.00 < 0.05)
                        Performance has improved.

- Remove a test that was silly and backwards, and intentionally thwarted the optimizer in a way that will hopefully stop working soon. - Add a test that an `IncrementalJoin`'s `virtual_plan` looks like we expect. - Rename the `JoinExpr` argument to `core::vm::join_inner` for clarity. - Sprinkle comments around about how we compile and optimize joins.

gefjon requested review from Centril and joshua-spacetime March 13, 2024 20:40

joshua-spacetime reviewed Mar 14, 2024

View reviewed changes

joshua-spacetime linked an issue Mar 14, 2024 that may be closed by this pull request

Make construction of inner join iterator faster #968

Closed

joshua-spacetime requested changes Mar 14, 2024

View reviewed changes

joshua-spacetime approved these changes Mar 14, 2024

View reviewed changes

Merge remote-tracking branch 'origin/master' into phoebe/semijoin-expr

7ea94cf

gefjon enabled auto-merge March 15, 2024 13:50

gefjon added this pull request to the merge queue Mar 15, 2024

Merged via the queue into master with commit 96e5ef1 Mar 15, 2024
6 checks passed

Centril deleted the phoebe/semijoin-expr branch March 18, 2024 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish between inner and semijoins in `QueryExpr` AST. #969

Distinguish between inner and semijoins in `QueryExpr` AST. #969

gefjon commented Mar 13, 2024

joshua-spacetime Mar 14, 2024

joshua-spacetime Mar 14, 2024

joshua-spacetime Mar 14, 2024

joshua-spacetime left a comment •

edited

Loading

joshua-spacetime Mar 14, 2024

joshua-spacetime commented Mar 14, 2024

Distinguish between inner and semijoins in QueryExpr AST. #969

Distinguish between inner and semijoins in QueryExpr AST. #969

Conversation

gefjon commented Mar 13, 2024

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

joshua-spacetime Mar 14, 2024

Choose a reason for hiding this comment

joshua-spacetime Mar 14, 2024

Choose a reason for hiding this comment

joshua-spacetime Mar 14, 2024

Choose a reason for hiding this comment

joshua-spacetime left a comment • edited Loading

Choose a reason for hiding this comment

joshua-spacetime Mar 14, 2024

Choose a reason for hiding this comment

joshua-spacetime commented Mar 14, 2024

Distinguish between inner and semijoins in `QueryExpr` AST. #969

Distinguish between inner and semijoins in `QueryExpr` AST. #969

joshua-spacetime left a comment •

edited

Loading