Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] Enhancement: Gluten advanced cost-based query optimization (advanced CBO) #5057

Closed
zhztheplayer opened this issue Mar 21, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@zhztheplayer
Copy link
Member

zhztheplayer commented Mar 21, 2024

Enhancement: Gluten Advanced Cost-based Query Optimization (Advanced CBO)

Background

Many of developers may already noticed that Apache Spark's Catalyst query planner basically heuristically applies rules without doing a global search for best plan (except for some local cases like join-reorder). This have been bringing a general issue to Gluten: With writing heuristic rules Gluten just gained limited capability on deciding offloading which part of computation to native engine, and users may frequently find the generated plan is sub-optimal and could cause performance issues.

By this enhancement proposal we would like to introduce some techniques derived from the well-known Volcano / Cascades query optimizer theory to Gluten's code base to deal with this. To distinguish with vanilla Spark's CBO (cost-based optimization), we'd name the new thing Advanced CBO (hereinafter also called ACBO). Note, naming it with "advanced" doesn't actually mean it's required to be literally more advanced than existing query optimization strategy in Apache Spark, but should have the potential to be more suitable for native accelerator like Gluten's adoption scenarios.

Design

The rough design of ACBO will follow these principles:

Plan enumeration

ACBO should basically have the capability to do plan enumeration. In the worst case, the search space of a N node plan in Gluten would be at least O(2^N) (assume each node has vanilla / native versions of implementation). So memorized-search should be adopted to make the overall search space smaller.

General purpose core module

Make the search engine of ACBO's optimizer general. It should be able to handle a larger set of generic plan representation than Spark / Gluten plan. This is because what we are doing is not developing an optimizer for a new DBMS but to integrate an existing system into the new optimizer. If the the design is general enough, then we can:

  1. Smoothly know whether a further issue is because of Gluten's code or caused by ACBO during development.
  2. Keep the optimizer kernal away from backend layers. ACBO's search engine will have less compile-time dependencies with having Gluten / Spark's jars excluded. It eases maintenance for long-term.

ACBO will be a new Maven module of gluten. Will use 'gluten-cbo'.

Individually work with Spark's CBO

ACBO could be turn on/off despite whether user turns Spark's CBO on or off. Though we can get better statistics from Spark's CBO and ACBO could leverage that when Spark CBO is also turned on.

As a Spark columnar rule

ACBO will only operate on Spark's physical plan. We can decide whether to enable it in logical plan optimization in future. But that would not be a goal for long while. A smaller code entrance for ACBO would lower the chance of adding maintenance burden, or facing problems caused by Gluten's legacy code issues.

Goal

The following are some goals of the initial version of ACBO:

  1. Deliver a general purpose CBO framework with global plan enumeration and search enabled;
  2. Selectively enable some rules in Gluten's rule list in ACBO; Provide user the option to turn the ACBO on/off.
    The option should be by default off.
  3. If possible, replace the main single-op validation rule (it's now "RegularTransformRule") and C2R / R2C adding rule with ACBO version.

We don't have to expect real performance speed-up in the 1st move, because:

  1. We don't yet decide to deliver a solid cost model at the 1st version of ACBO;
  2. ACBO will initially only cover a few of the original Spark rules.

On the other hand, one of the most important purpose at this period is to integrate ACBO with current optimizer framework "successfully", and gradually hand over optimization workload from current planner to ACBO. Writing ACBO rules will be comparatively easier than writing heuristic rules: developer just tells the optimizer one plan can be replaced by another, and doesn't have to guarantee the replacement provides any profit or not. Thus, removing the RBO rules and migrating it to ACBO will always simplify the code. Thus one of ACBO's bonus is to make Gluten's code more maintainable.

Non-goal

It's worth noting again that we don't plan to replace catalyst's query optimization by ACBO. ACBO would work as one or several individual rules in Spark's query execution procedure.

@zhztheplayer zhztheplayer added the enhancement New feature or request label Mar 21, 2024
@zhztheplayer zhztheplayer changed the title Enhancement: Gluten Advanced Cost-based Query Optimization (Advanced CBO) [CORE] Enhancement: Gluten Advanced Cost-based Query Optimization (Advanced CBO) Mar 21, 2024
@zhztheplayer
Copy link
Member Author

There is a runnable PoC for this. Will submit a PR ASAP.

@zhztheplayer zhztheplayer changed the title [CORE] Enhancement: Gluten Advanced Cost-based Query Optimization (Advanced CBO) [CORE] Enhancement: Gluten advanced cost-based query optimization (advanced CBO) Mar 23, 2024
@zhztheplayer
Copy link
Member Author

Closing since first PR merged at c1e1cca.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant