Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SQL in Pinot. #4219

Closed
kishoreg opened this issue May 17, 2019 · 17 comments
Closed

Support SQL in Pinot. #4219

kishoreg opened this issue May 17, 2019 · 17 comments
Assignees
Labels

Comments

@kishoreg
Copy link
Member

While existing PQL is almost similar to SQL, it has some nuances that make it difficult for users to query pinot. Our eventual goal should be able to support most of the sql features on a single table (excluding joins). This will allow tools such as superset, grafana, etc to be work seamlessly on top of Pinot.

Currently, adding a new feature requires us to change pql grammar, broker request, and the execution layer. This means there is a high turn around time for each feature.

The right end state would be - "change only the execution layer to add new features - grammar and broker request should be complete and flexible". I am proposing we move to calcite and leverage the full sql support and parser that comes with Calcite.

Proposal

  • Create PinotQuery IDL(thrift)- this should represent the pql/sql. Additional information such as debugoptions should not be part of this IDL
  • Embed PinotQuery inside BrokerRequest.
  • AST Parser to convert PQL to PinotQuery

For backward compatibility, populate old objects in BrokerRequest (selection aggregation groupby etc) from pinot query.

  • Change the execution layer to use BrokerRequest.pinotQuery
  • Introduce another endpoint that takes calcite sql and converts it into brokerrequest

pql 2 calcite

@Jackie-Jiang
Copy link
Contributor

Suggest changing Enhanced Broker Request to Pinot Query to make them totally separate from each other. No need to nest them IMO.

For steps of migration:

  1. Implement PQL -> Pinot Query and Pinot Query -> Broker Request
  2. Support Pinot Query in Execution Layer (probably through a separate endpoint)
  3. Implement SQL -> Pinot Query
  4. (Optional) Remove PQL and Broker Request

Migration Steps

@kishoreg
Copy link
Member Author

kishoreg commented May 18, 2019

@Jackie-Jiang yes, I would love to do this and this was how I started. The problem was that brokerrequest was very widely used in the code and it was very hard to replace it with pinot query in one shot.

That's why we put pinotquery inside brokerrequest which kind of makes sense if you think request consisting of query, debugoptions etc.

One key goal was to never have something like this in the query execution code.

//use broker request
} else {
  //pinot query
}

If we keep old fields (selection, aggInfoList etc) and pinotQuery in the brokerRequest, we can migrate to pinotQuery in steps - operator by operator. For e.g. solve only the selection operator then transform operator and so on.

I agree with you only doubling the size of brokerRequest until we migrate the execution layer to use brokerrequest.pinotquery at which point we can delete the references to Selection, etc. I don't think this will add significant overhead in terms of network or cpu usage. We have the validation flag to test this easily.

Final Structure of BrokerRequest will be as follows

  PinotQuery pinotQuery;
  bool enableTrace;
  Map<String,String> requestOptions;
}```



@Jackie-Jiang
Copy link
Contributor

@kishoreg Short term wise, we can nest PinotQuery into BrokerRequest so we can support the functionalities step by step if the overhead is not an issue. Long term wise, I'll still suggest replacing BrokerRequest with PinotQuery once all the functionalities are supported. enableTrace should be inside the debugOptions, and requestOptions should be part of queryOptions from PinotQuery.

@kishoreg
Copy link
Member Author

@Jackie-Jiang, yes once we migrate the execution layer to work on PinotQuery, we can definitely move over to PinotQuery if needed.

The way I see it is broker takes in a bunch of request params - pql, enableTrace, debugOptions and we will add few more things in the future. The brokerRequest we have today makes sense because it captures everything. The only difference is that while enableTrace and debugOptions have 1:1 mapping in brokerRequest, pql is split into many different fields. Adding PinotQuery will encapsulate all elements of the pql/sql within it.

Anyways, we can revisit this when we refactor the execution layer to use PinotQuery.

@walterddr
Copy link
Contributor

Hi here, I was mostly a user for pinot but pretty new to the development - really glad to see that Pinot is adopting ANSI SQL through Calcite.

One question: based on the design description, seems like only the parser phase of the Calcite project is being considered here (Stringify SQL -> SQLNode); is there any plan to consider adopting more of the Calcite capacity, e.g. the optimizer?

@sunithabeeram
Copy link
Contributor

@kishoreg - I missed seeing this issue earlier. It will be good to see the reasons for going the Calcite route. Based on internal discussions, it appears that leveraging Calcite for just parsing wouldn't yield a lot of benefit and infact might be harder to maintain than the corresponding antlr grammar based one.

Also, it will be good define how close to sql compliance we want to get and how PQL will evolve - will everything have to fit into sql parlance?

@kishoreg
Copy link
Member Author

@walterddr the plan is to only use the parser phase of calcite. I don't see using the optimizers in calcite will provide any value because of the following

  • Most of the optimizers in calcite optimize the logical plan (suitable for complex joins/nested queries etc). We cannot leverage them since Pinot does not support joins and nested queries
  • Pinot comes up with per segment optimized physical plan based on each segment's metadata. Calcite does not support this.

@kishoreg
Copy link
Member Author

@sunithabeeram Yes, we are planning to use calcite just for parsing. I had the same opinion but I realized that it is more flexible and complete than the PQL grammar. Every feature we need to add in Pinot, we end up touching the grammar, the parser, and the execution layer. For e.g. supporting regex, expression, alias, map syntax etc. It's just a distraction and additional code to maintain.
We already tried converting sqlNode to BrokerRequest and it worked well for all the queries we have. The only exception is the options we have in Pinot and we will have to handle that separately.
Calcite is widely adopted and supporting those queries will allow us Pinot to be used easily with other UI tools.

@walterddr
Copy link
Contributor

Thanks for the explanation @kishoreg . Yeah I agree that many of the optimizations might not work for Pinot and can understand why this could be a good idea to separate the grammar changes from the other core changes.

One of advantages, as you also mentioned, is the optimizer's capability to simplify and optimize logical plan. This might open up some possibilities when writing standard SQL: for example, nested queries that are more readable, but ultimately can be compiled as a simple query.

Regarding metadata support, I wounldn't say there's no support on calcite but I think this is definitely a limitation. Similar problem has been raised in Flink's calcite metadata handler as well: https://issues.apache.org/jira/browse/FLINK-11822.

@kishoreg
Copy link
Member Author

@walterddr I looked at the Jira link. If my understanding is right, I think the metadata it refers to is global and not per segment. In the case of Pinot, the physical plan is optimized on a per segment basis based on its metadata (cardinality, indexing, star-tree, dictionary etc)

Having said that, in future if we start supporting nested queries and look up joins, we might consider using the optimizers that come with calcite.

@walterddr
Copy link
Contributor

Thanks @kishoreg, I am not familiar with the usage of per-segment data stats in Calcite and not sure if that can be easily supported.
Thank you for the prompt clarification! they make perfect sense!

@sunny19930321
Copy link
Contributor

@kishoreg @Jackie-Jiang ,I am also a pinot user at present. When will this be online? I think it's great and I'm looking forward to it

@s-sahay s-sahay added the feature label Jun 4, 2019
@siddharthteotia
Copy link
Contributor

I believe using Calcite can prove to be useful in the long run as we converge closer to standard SQL support and achieving parity with general OLAP systems. Calcite's rich grammar, parser, semantic analysis and logical planning optimization rules will all help towards Pinot supporting a wide variety of complex SQL queries, including nested queries, query rewriting, materialized view substitution etc.

I say logical planning because physical planning or execution planning is something that Pinot will have to do because the standard operator tree will then be optimized and divided into per segment execution fragment.

As part of SQL support, what do people think about supporting some flavor of distributed joins? especially broadcast join which can be very helpful for large star schema joins having a large fact table and multiple smaller dimension tables and we can broadcast the smaller dimension table to each node and do a local in-memory hash join.

@kishoreg kishoreg self-assigned this Jun 27, 2019
@kishoreg
Copy link
Member Author

We have deliberately stayed away from implementing joins within Pinot. Having said that, lookup join is definitely on the table. If you are interested in contributing, please create an issue and start the discussion.

@sunny19930321
Copy link
Contributor

sunny19930321 commented Jun 28, 2019

Can we use Calcite to support adding, subtracting, multiplying and dividing aggregates,For example: SUM(a) /1024, SUM(a) /SUM(b) /1024, (SUM(a) * 1000) /SUM(b), etc. I think these functions are very useful for connecting to open source visualized superset, grafana, redash and metabase

@s-sahay s-sahay added the In Progress This work is in progress. label Oct 18, 2019
@xiangfu0
Copy link
Contributor

Proposed the changes in execution layer in order to simplify the sql rollout.
https://docs.google.com/document/d/1uNIq0cybUtVtdtJ38-4ewFNEQorbg-2KYr-CMSj6H_8/edit?usp=sharing

Jackie-Jiang added a commit that referenced this issue Sep 30, 2020
…filter (#6056)

Add `FilterOptimizer` which supports optimizing query filter from both `BrokerRequest` and `PinotQuery`.
`FilterOptimizer` will replace `FilterQueryTreeOptimizer` which only works on `BrokerRequest` query filter.
In order to fully support SQL (#4219), the query optimizer should perform the same optimization to `PinotQuery` as `BrokerRequest`.

Add `FlattenAndOrFilterOptimizer` to replace `FlattenNestedPredicatesFilterQueryTreeOptimizer`, and removes the limitation of flatten depth
Add `MergeEqInFilterOptimizer` to replace `MultipleOrEqualitiesToInClauseFilterQueryTreeOptimizer`
Add `MergeRangeFilterOptimizer` to replace `RangeMergeOptimizer`, and supports merging range for all single-value columns (based on schema)

This PR only adds the new code. The following PR will wire the new code and remove the old code.
@kishoreg kishoreg removed the In Progress This work is in progress. label Oct 3, 2020
@kishoreg
Copy link
Member Author

kishoreg commented Oct 3, 2020

This is done and is available since 0.3.0

@kishoreg kishoreg closed this as completed Oct 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants