Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between inner and semijoins in QueryExpr AST. #969

Merged
merged 3 commits into from
Mar 15, 2024

Conversation

gefjon
Copy link
Contributor

@gefjon gefjon commented Mar 13, 2024

Description of Changes

This commit adds a flag semi: bool to JoinExpr, which signifies a semijoin, as opposed to an inner join.

A new optimization pass, QueryExpr::try_semi_join, is defined which can detect a certain common case of inner joins and rewrite them into semijoins.

The punchline here is that core::vm::join_inner used to accept a flag semi: bool which it could use to avoid some expensive Header mutations, but that flag was always passed as false because we had no way to distinguish semijoins. With this commit, the flag is actually used,
so evaluating non-indexed semijoins should avoid allocating a new Header.

API and ABI breaking changes

N/a

Expected complexity level and risk

3 - our query planner and evaluator depend strongly (more strongly than they should) on specific parses and representations in some places. I believe I found all such places and changed them, but am not confident.

This commit adds a flag `semi: bool` to `JoinExpr`, which signifies a semijoin,
as opposed to an inner join.

A new optimization pass, `QueryExpr::try_semi_join`, is defined
which can detect a certain common case of inner joins and rewrite them into semijoins.

The punchline here is that `core::vm::join_inner` used to accept a flag `semi: bool`
which it could use to avoid some expensive `Header` mutations,
but that flag was always passed as `false` because we had no way to distinguish semijoins.
With this commit, the flag is actually used,
so evaluating non-indexed semijoins should avoid allocating a new `Header`.
crates/core/src/vm.rs Outdated Show resolved Hide resolved
let mut sources = SourceSet::default();
let rhs_source_expr = sources.add_mem_table(data);

let q = query(&schema).with_join_inner(rhs_source_expr, FieldName::positional("inventory", 0), rhs, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer a slightly higher level test that goes through the entire optimizer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out, we already have test coverage of this part, so these are redundant. Feel free to remove.

let source_expr = sources.add_mem_table(table.clone());
let second_source_expr = sources.add_mem_table(table);

let q = query(source_expr).with_join_inner(second_source_expr, field.clone(), field, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think I would prefer a higher level test that goes from sql to result set. You can use RelationalDB::create_table_for_test along with sql::execute::run.

crates/vm/src/expr.rs Show resolved Hide resolved
crates/vm/src/expr.rs Show resolved Hide resolved
crates/vm/src/expr.rs Show resolved Hide resolved
@joshua-spacetime joshua-spacetime linked an issue Mar 14, 2024 that may be closed by this pull request
Copy link
Collaborator

@joshua-spacetime joshua-spacetime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to request one more test case. That we correctly transform an IndexJoin between delta tables to the corresponding semijoin.

Correction: we already have test coverage for this. If you can update the other tests, this should be good to merge.

crates/vm/src/expr.rs Outdated Show resolved Hide resolved

let q = QueryExpr {
source: lhs_source,
// Build the query manually, because `.with_select` will attempt to push selections before the join.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate. All of this should be included as part of optimize. But of course this is a preexisting issue.

@joshua-spacetime
Copy link
Collaborator

But I should add that this change does improve the performance of incremental join exactly as expected.

incr-join               time:   [3.1750 µs 3.1777 µs 3.1807 µs]
                        change: [-27.737% -27.633% -27.519%] (p = 0.00 < 0.05)
                        Performance has improved.

- Remove a test that was silly and backwards, and intentionally thwarted the optimizer
  in a way that will hopefully stop working soon.
- Add a test that an `IncrementalJoin`'s `virtual_plan` looks like we expect.
- Rename the `JoinExpr` argument to `core::vm::join_inner` for clarity.
- Sprinkle comments around about how we compile and optimize joins.
@gefjon gefjon enabled auto-merge March 15, 2024 13:50
@gefjon gefjon added this pull request to the merge queue Mar 15, 2024
Merged via the queue into master with commit 96e5ef1 Mar 15, 2024
6 checks passed
@Centril Centril deleted the phoebe/semijoin-expr branch March 18, 2024 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make construction of inner join iterator faster
2 participants