Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built-in SQL #3682

Merged
merged 1 commit into from
Dec 17, 2016
Merged

Built-in SQL #3682

merged 1 commit into from
Dec 17, 2016

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Nov 11, 2016

This is code corresponding to the proposal at: https://groups.google.com/d/msg/druid-development/3npt9Qxpjr0/F-t--qMNBQAJ

It includes:

There's 2 ways of issuing SQL queries, both on the broker:

SQL querying is disabled by default in this PR, since enabling it causes brokers to make some extra metadata queries. I expect the extra load will be small, but I'm wanting to be conservative.

This depends on the latest Calcite (1.10.0) although there are some features that will only work when 1.11.0 is released. These are marked with CALCITE_1_11_0 in CalciteQueryTest.

Benchmarks shows minimal overhead for our benchmark queries. I expect there to be some overhead since SQL planning takes some time, and also we're going through Calcite's JDBC adapter on the output side. That could potentially be bypassed in a future patch.

Benchmark                 (rowsPerSegment)  Mode  Cnt     Score    Error  Units
SqlBenchmark.queryNative             10000  avgt   30    16.588 ±  0.283  ms/op
SqlBenchmark.queryNative            100000  avgt   30   920.364 ± 18.956  ms/op
SqlBenchmark.queryNative            200000  avgt   30  3604.093 ± 58.809  ms/op
SqlBenchmark.querySql                10000  avgt   30    35.641 ± 30.452  ms/op
SqlBenchmark.querySql               100000  avgt   30   960.524 ± 60.775  ms/op
SqlBenchmark.querySql               200000  avgt   30  3665.385 ± 48.119  ms/op

To consider for future patches:

  • Make the SQL language extendable, so we can add new functions in extensions.
  • Potentially bypass Calcite's JDBC adapter, just using it for parsing and planning.
  • Support for more Druid and SQL features.

The original PR had some questions, which are answered in the latest diff:

  • Should SQL querying be enabled by default? (Latest diff: no, it's not enabled by default)
  • Should background metadata fetching be enabled by default? See DruidSchema.java for how this is done. Basically the broker will issue SegmentMetadataQueries (with a throttle) when it notices new dataSources or new segments. There's a case to be made that even with a throttle, we shouldn't enable this by default (principle of least surprise for upgrading clusters). There's also a case to be made that we should enable it by default, and call out in the release notes how to turn it off (principle of ease of use for new users). (Latest diff: background metadata fetching is enabled if SQL is, but SQL is disabled by default)
  • If background metadata fetching is turned off, what happens when someone issues a SQL query? Do we just fetch it on demand? (Latest diff: background metadata fetching is always enabled if SQL is enabled)

@gianm gianm added the Discuss label Nov 11, 2016
@gianm gianm changed the title Built-in SQL [WIP] Built-in SQL Nov 11, 2016
@cheddar
Copy link
Contributor

cheddar commented Nov 11, 2016

For metadata, another option would be to have the coordinator have that information and the brokers to cache it from there. It could have a fallback that causes it to update its local state based on segment metadata queries if it's having trouble loading from the coordinator (or through a desc SQL command or something)

@@ -0,0 +1,73 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Druid - a distributed column store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong header

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I just copied this from another pom. Will change.

@fjy fjy added this to the 0.9.3 milestone Nov 11, 2016
@fjy fjy added the Feature label Nov 11, 2016
@gianm
Copy link
Contributor Author

gianm commented Nov 12, 2016

@cheddar That means the coordinator would have to know how to issue queries to data nodes, right? I thought we wanted to avoid that.

@jihoonson
Copy link
Contributor

I tried to test, and got an error as follows.

$ cat test_metadata.json 
{
  "query" : "SELECT * FROM metadata.TABLES WHERE tableType = ?",
  "parameters" : ["TABLE"]
}

$ curl -XPOST -H'Content-Type: application/json' http://localhost:8082/druid/v2/sql/ -d @test_metadata.json
[{"tableCat":null,"tableSchem":"druid","tableName":"wikiticker","tableType":"TABLE","remarks":null,"typeCat":null,"typeSchem":null,"typeName":null,"selfReferencingColName":null,"refGeneration":null}]

$ curl -XPOST -H'Content-Type: application/json' http://localhost:8082/druid/v2/sql/ -d '{"query":"SELECT COUNT(*) FROM wikiticker"}'
{"error":"Unknown exception","errorMessage":"From line 1, column 22 to line 1, column 31: Table 'wikiticker' not found","errorClass":"org.apache.calcite.runtime.CalciteContextException","host":null}

Is there something to do before executing a sql?

@gianm
Copy link
Contributor Author

gianm commented Nov 12, 2016

Try SELECT COUNT(*) FROM druid.wikiticker instead, you need the "druid." part.

@gianm
Copy link
Contributor Author

gianm commented Nov 12, 2016

I moved the PostgreSQL compat stuff out of this PR, since it barely works and it was clogging up the diff.

@jihoonson
Copy link
Contributor

Thanks. It works well.

@cheddar
Copy link
Contributor

cheddar commented Nov 15, 2016

Hrm, yeah, I can believe I can likely be quoted as saying we don't want coordinators to query. Though, at this point in time, I could be convinced that as long as the coordinator is not actively involved in the primary query path, that would be good enough.

My primary concern with the updating is that you are going to generate load on the cluster for every single broker. While it's arguable if the load is a significant amount, the fact that it's a linear amount of load is what bothers me and the suggestion to consolidate on the coordinator was an attempt at making it a constant amount of load. If you can think of another way to make it a constant amount of load, I'd also be down with that.

@gianm
Copy link
Contributor Author

gianm commented Dec 5, 2016

@cheddar I can think of some ways to reduce the linearly increasing load, like caching metadata more aggressively on historicals, but can't think of a good way to do a truly constant load other than moving this to the coordinator.

In the meantime I can add a config "druid.sql.enable" that you can use to turn all this off if you don't want it, which will mean that people that don't need SQL won't get the additional metadata load. Would that work for you?

And even though it's linear, I don't expect the load of this metadata querying to be too large, since the SegmentMetadataQuery results should mostly be cached on historicals anyway.

@gianm gianm removed the Discuss label Dec 5, 2016
@gianm gianm changed the title [WIP] Built-in SQL Built-in SQL Dec 5, 2016
@gianm
Copy link
Contributor Author

gianm commented Dec 5, 2016

Removed the "WIP" tag since I believe this is ready for review. I updated the top comment with the current state of the patch, please refer to that for context.

@cheddar
Copy link
Contributor

cheddar commented Dec 9, 2016

Given that this is pretty experimental, I'm fine with this coming in. I just double checked if it is opt-in and it looks like it's primarily just new classes with an entry point of a Jersey resource.

That got me wondering if it should be introduced as an extension module?

Either way, I'm 👍 with it being included.

|`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M|
|`druid.sql.planner.selectPageSize`|Page size threshold for [Select queries](../querying/select-query.html). Select queries for larger resultsets will be issued back-to-back using pagination.|1000|
|`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinalty algorithm for `COUNT(DISTINCT foo)`.|true|
|`druid.sql.planner.useApproximateTopN`|Whether to use approximate [TopN queries](../querying/topnquery.html) when a SQL query could be expressed as such. If false, exact [GroupBy queries](../querying/groupbyquery.html) will be used instead.|true|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we switch this at query time ? maybe adding some runtime config endpoints ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-slim that seems useful although right now there is no mechanism for adjusting planner configs at runtime. It could probably be added in a future PR.

- [Query-time lookups](lookups.html).
- [Nested groupBy queries](groupbyquery.html#nested-groupbys).
- Extensions, including [approximate histograms](../development/extensions-core/approximate-histograms.html) and
[DataSketches](../development/extensions-core/datasketches-aggregators.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we support HLL not sure why DataSketches can not be supported ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-slim just because HLL is in core. I haven't added an extension point yet (was hoping to do that in a future PR) and so only features in core Druid are usable right now.

@fjy
Copy link
Contributor

fjy commented Dec 12, 2016

👍 Given that is has minimal impact and has been tested

@fjy fjy assigned cheddar and fjy Dec 13, 2016
@gianm
Copy link
Contributor Author

gianm commented Dec 13, 2016

@cheddar, re:

That got me wondering if it should be introduced as an extension module?

The idea behind making it a core module is that I thought that at some point we would want to add an extension point for SQL UDFs to datasketches, approximate histograms, etc. To make that work, either the UDFs would need to be in the various extensions and they would need to require druid-sql, or the UDFs would need to be in druid-sql and it would need to require the various extensions.

The former made more sense to me, and given that implies druid-sql will one day be a dependency of a variety of extensions, it seemed like it belongs in core. Otherwise we get into extensions requiring other extensions that seem only lightly related, and that would be confusing to users.

If you have a better idea about how to handle extensions / UDFs with SQL then I am all ears.

@gianm
Copy link
Contributor Author

gianm commented Dec 16, 2016

Fixed conflicts with master.

@fjy fjy merged commit dd63f54 into apache:master Dec 17, 2016
@gianm gianm deleted the sql branch December 19, 2016 18:18
@praveev
Copy link
Contributor

praveev commented Dec 23, 2016

@gianm I believe there is a conflict with jackson-databind 2.6.3 from avatica.jar (comes from calcite-core) and druid's explicit jackson-databind 2.4.6 dependency. I see issues during batch ingestion. Realtime using tranquility is fine. Here is a gist containing the stack trace

@gianm
Copy link
Contributor Author

gianm commented Dec 23, 2016

@praveev could you raise a separate issue for this please? Does adding exclusions to the pom help?

@praveev
Copy link
Contributor

praveev commented Dec 23, 2016

@gianm I've created #3800 to track this. Exclusions didn't help.

dgolitsyn pushed a commit to metamx/druid that referenced this pull request Feb 14, 2017
@jihoonson jihoonson mentioned this pull request Jul 12, 2017
@yurmix yurmix mentioned this pull request May 29, 2019
seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants