New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built-in SQL #3682

Merged
merged 1 commit into from Dec 17, 2016

Conversation

Projects
None yet
6 participants
@gianm
Contributor

gianm commented Nov 11, 2016

This is code corresponding to the proposal at: https://groups.google.com/d/msg/druid-development/3npt9Qxpjr0/F-t--qMNBQAJ

It includes:

There's 2 ways of issuing SQL queries, both on the broker:

SQL querying is disabled by default in this PR, since enabling it causes brokers to make some extra metadata queries. I expect the extra load will be small, but I'm wanting to be conservative.

This depends on the latest Calcite (1.10.0) although there are some features that will only work when 1.11.0 is released. These are marked with CALCITE_1_11_0 in CalciteQueryTest.

Benchmarks shows minimal overhead for our benchmark queries. I expect there to be some overhead since SQL planning takes some time, and also we're going through Calcite's JDBC adapter on the output side. That could potentially be bypassed in a future patch.

Benchmark                 (rowsPerSegment)  Mode  Cnt     Score    Error  Units
SqlBenchmark.queryNative             10000  avgt   30    16.588 ±  0.283  ms/op
SqlBenchmark.queryNative            100000  avgt   30   920.364 ± 18.956  ms/op
SqlBenchmark.queryNative            200000  avgt   30  3604.093 ± 58.809  ms/op
SqlBenchmark.querySql                10000  avgt   30    35.641 ± 30.452  ms/op
SqlBenchmark.querySql               100000  avgt   30   960.524 ± 60.775  ms/op
SqlBenchmark.querySql               200000  avgt   30  3665.385 ± 48.119  ms/op

To consider for future patches:

  • Make the SQL language extendable, so we can add new functions in extensions.
  • Potentially bypass Calcite's JDBC adapter, just using it for parsing and planning.
  • Support for more Druid and SQL features.

The original PR had some questions, which are answered in the latest diff:

  • Should SQL querying be enabled by default? (Latest diff: no, it's not enabled by default)
  • Should background metadata fetching be enabled by default? See DruidSchema.java for how this is done. Basically the broker will issue SegmentMetadataQueries (with a throttle) when it notices new dataSources or new segments. There's a case to be made that even with a throttle, we shouldn't enable this by default (principle of least surprise for upgrading clusters). There's also a case to be made that we should enable it by default, and call out in the release notes how to turn it off (principle of ease of use for new users). (Latest diff: background metadata fetching is enabled if SQL is, but SQL is disabled by default)
  • If background metadata fetching is turned off, what happens when someone issues a SQL query? Do we just fetch it on demand? (Latest diff: background metadata fetching is always enabled if SQL is enabled)

@gianm gianm added the Discuss label Nov 11, 2016

@gianm gianm changed the title from Built-in SQL to [WIP] Built-in SQL Nov 11, 2016

@cheddar

This comment has been minimized.

Show comment
Hide comment
@cheddar

cheddar Nov 11, 2016

Contributor

For metadata, another option would be to have the coordinator have that information and the brokers to cache it from there. It could have a fallback that causes it to update its local state based on segment metadata queries if it's having trouble loading from the coordinator (or through a desc SQL command or something)

Contributor

cheddar commented Nov 11, 2016

For metadata, another option would be to have the coordinator have that information and the brokers to cache it from there. It could have a fallback that causes it to update its local state based on segment metadata queries if it's having trouble loading from the coordinator (or through a desc SQL command or something)

Show outdated Hide outdated sql/pom.xml
@@ -0,0 +1,73 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Druid - a distributed column store.

This comment has been minimized.

@fjy

fjy Nov 11, 2016

Contributor

wrong header

@fjy

fjy Nov 11, 2016

Contributor

wrong header

This comment has been minimized.

@gianm

gianm Nov 11, 2016

Contributor

Ah, I just copied this from another pom. Will change.

@gianm

gianm Nov 11, 2016

Contributor

Ah, I just copied this from another pom. Will change.

@fjy fjy added this to the 0.9.3 milestone Nov 11, 2016

@fjy fjy added the Feature label Nov 11, 2016

@gianm

This comment has been minimized.

Show comment
Hide comment
@gianm

gianm Nov 12, 2016

Contributor

@cheddar That means the coordinator would have to know how to issue queries to data nodes, right? I thought we wanted to avoid that.

Contributor

gianm commented Nov 12, 2016

@cheddar That means the coordinator would have to know how to issue queries to data nodes, right? I thought we wanted to avoid that.

@jihoonson

This comment has been minimized.

Show comment
Hide comment
@jihoonson

jihoonson Nov 12, 2016

Contributor

I tried to test, and got an error as follows.

$ cat test_metadata.json 
{
  "query" : "SELECT * FROM metadata.TABLES WHERE tableType = ?",
  "parameters" : ["TABLE"]
}

$ curl -XPOST -H'Content-Type: application/json' http://localhost:8082/druid/v2/sql/ -d @test_metadata.json
[{"tableCat":null,"tableSchem":"druid","tableName":"wikiticker","tableType":"TABLE","remarks":null,"typeCat":null,"typeSchem":null,"typeName":null,"selfReferencingColName":null,"refGeneration":null}]

$ curl -XPOST -H'Content-Type: application/json' http://localhost:8082/druid/v2/sql/ -d '{"query":"SELECT COUNT(*) FROM wikiticker"}'
{"error":"Unknown exception","errorMessage":"From line 1, column 22 to line 1, column 31: Table 'wikiticker' not found","errorClass":"org.apache.calcite.runtime.CalciteContextException","host":null}

Is there something to do before executing a sql?

Contributor

jihoonson commented Nov 12, 2016

I tried to test, and got an error as follows.

$ cat test_metadata.json 
{
  "query" : "SELECT * FROM metadata.TABLES WHERE tableType = ?",
  "parameters" : ["TABLE"]
}

$ curl -XPOST -H'Content-Type: application/json' http://localhost:8082/druid/v2/sql/ -d @test_metadata.json
[{"tableCat":null,"tableSchem":"druid","tableName":"wikiticker","tableType":"TABLE","remarks":null,"typeCat":null,"typeSchem":null,"typeName":null,"selfReferencingColName":null,"refGeneration":null}]

$ curl -XPOST -H'Content-Type: application/json' http://localhost:8082/druid/v2/sql/ -d '{"query":"SELECT COUNT(*) FROM wikiticker"}'
{"error":"Unknown exception","errorMessage":"From line 1, column 22 to line 1, column 31: Table 'wikiticker' not found","errorClass":"org.apache.calcite.runtime.CalciteContextException","host":null}

Is there something to do before executing a sql?

@gianm

This comment has been minimized.

Show comment
Hide comment
@gianm

gianm Nov 12, 2016

Contributor

Try SELECT COUNT(*) FROM druid.wikiticker instead, you need the "druid." part.

Contributor

gianm commented Nov 12, 2016

Try SELECT COUNT(*) FROM druid.wikiticker instead, you need the "druid." part.

@gianm

This comment has been minimized.

Show comment
Hide comment
@gianm

gianm Nov 12, 2016

Contributor

I moved the PostgreSQL compat stuff out of this PR, since it barely works and it was clogging up the diff.

Contributor

gianm commented Nov 12, 2016

I moved the PostgreSQL compat stuff out of this PR, since it barely works and it was clogging up the diff.

@jihoonson

This comment has been minimized.

Show comment
Hide comment
@jihoonson

jihoonson Nov 13, 2016

Contributor

Thanks. It works well.

Contributor

jihoonson commented Nov 13, 2016

Thanks. It works well.

@cheddar

This comment has been minimized.

Show comment
Hide comment
@cheddar

cheddar Nov 15, 2016

Contributor

Hrm, yeah, I can believe I can likely be quoted as saying we don't want coordinators to query. Though, at this point in time, I could be convinced that as long as the coordinator is not actively involved in the primary query path, that would be good enough.

My primary concern with the updating is that you are going to generate load on the cluster for every single broker. While it's arguable if the load is a significant amount, the fact that it's a linear amount of load is what bothers me and the suggestion to consolidate on the coordinator was an attempt at making it a constant amount of load. If you can think of another way to make it a constant amount of load, I'd also be down with that.

Contributor

cheddar commented Nov 15, 2016

Hrm, yeah, I can believe I can likely be quoted as saying we don't want coordinators to query. Though, at this point in time, I could be convinced that as long as the coordinator is not actively involved in the primary query path, that would be good enough.

My primary concern with the updating is that you are going to generate load on the cluster for every single broker. While it's arguable if the load is a significant amount, the fact that it's a linear amount of load is what bothers me and the suggestion to consolidate on the coordinator was an attempt at making it a constant amount of load. If you can think of another way to make it a constant amount of load, I'd also be down with that.

@gianm

This comment has been minimized.

Show comment
Hide comment
@gianm

gianm Dec 5, 2016

Contributor

@cheddar I can think of some ways to reduce the linearly increasing load, like caching metadata more aggressively on historicals, but can't think of a good way to do a truly constant load other than moving this to the coordinator.

In the meantime I can add a config "druid.sql.enable" that you can use to turn all this off if you don't want it, which will mean that people that don't need SQL won't get the additional metadata load. Would that work for you?

And even though it's linear, I don't expect the load of this metadata querying to be too large, since the SegmentMetadataQuery results should mostly be cached on historicals anyway.

Contributor

gianm commented Dec 5, 2016

@cheddar I can think of some ways to reduce the linearly increasing load, like caching metadata more aggressively on historicals, but can't think of a good way to do a truly constant load other than moving this to the coordinator.

In the meantime I can add a config "druid.sql.enable" that you can use to turn all this off if you don't want it, which will mean that people that don't need SQL won't get the additional metadata load. Would that work for you?

And even though it's linear, I don't expect the load of this metadata querying to be too large, since the SegmentMetadataQuery results should mostly be cached on historicals anyway.

@gianm gianm removed the Discuss label Dec 5, 2016

@gianm gianm changed the title from [WIP] Built-in SQL to Built-in SQL Dec 5, 2016

@gianm

This comment has been minimized.

Show comment
Hide comment
@gianm

gianm Dec 5, 2016

Contributor

Removed the "WIP" tag since I believe this is ready for review. I updated the top comment with the current state of the patch, please refer to that for context.

Contributor

gianm commented Dec 5, 2016

Removed the "WIP" tag since I believe this is ready for review. I updated the top comment with the current state of the patch, please refer to that for context.

@cheddar

This comment has been minimized.

Show comment
Hide comment
@cheddar

cheddar Dec 9, 2016

Contributor

Given that this is pretty experimental, I'm fine with this coming in. I just double checked if it is opt-in and it looks like it's primarily just new classes with an entry point of a Jersey resource.

That got me wondering if it should be introduced as an extension module?

Either way, I'm 👍 with it being included.

Contributor

cheddar commented Dec 9, 2016

Given that this is pretty experimental, I'm fine with this coming in. I just double checked if it is opt-in and it looks like it's primarily just new classes with an entry point of a Jersey resource.

That got me wondering if it should be introduced as an extension module?

Either way, I'm 👍 with it being included.

|`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M|
|`druid.sql.planner.selectPageSize`|Page size threshold for [Select queries](../querying/select-query.html). Select queries for larger resultsets will be issued back-to-back using pagination.|1000|
|`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinalty algorithm for `COUNT(DISTINCT foo)`.|true|
|`druid.sql.planner.useApproximateTopN`|Whether to use approximate [TopN queries](../querying/topnquery.html) when a SQL query could be expressed as such. If false, exact [GroupBy queries](../querying/groupbyquery.html) will be used instead.|true|

This comment has been minimized.

@b-slim

b-slim Dec 10, 2016

Contributor

can't we switch this at query time ? maybe adding some runtime config endpoints ?

@b-slim

b-slim Dec 10, 2016

Contributor

can't we switch this at query time ? maybe adding some runtime config endpoints ?

This comment has been minimized.

@gianm

gianm Dec 13, 2016

Contributor

@b-slim that seems useful although right now there is no mechanism for adjusting planner configs at runtime. It could probably be added in a future PR.

@gianm

gianm Dec 13, 2016

Contributor

@b-slim that seems useful although right now there is no mechanism for adjusting planner configs at runtime. It could probably be added in a future PR.

- [Query-time lookups](lookups.html).
- [Nested groupBy queries](groupbyquery.html#nested-groupbys).
- Extensions, including [approximate histograms](../development/extensions-core/approximate-histograms.html) and
[DataSketches](../development/extensions-core/datasketches-aggregators.html).

This comment has been minimized.

@b-slim

b-slim Dec 10, 2016

Contributor

if we support HLL not sure why DataSketches can not be supported ?

@b-slim

b-slim Dec 10, 2016

Contributor

if we support HLL not sure why DataSketches can not be supported ?

This comment has been minimized.

@gianm

gianm Dec 13, 2016

Contributor

@b-slim just because HLL is in core. I haven't added an extension point yet (was hoping to do that in a future PR) and so only features in core Druid are usable right now.

@gianm

gianm Dec 13, 2016

Contributor

@b-slim just because HLL is in core. I haven't added an extension point yet (was hoping to do that in a future PR) and so only features in core Druid are usable right now.

@fjy

This comment has been minimized.

Show comment
Hide comment
@fjy

fjy Dec 12, 2016

Contributor

👍 Given that is has minimal impact and has been tested

Contributor

fjy commented Dec 12, 2016

👍 Given that is has minimal impact and has been tested

@fjy fjy assigned cheddar and fjy Dec 13, 2016

@gianm

This comment has been minimized.

Show comment
Hide comment
@gianm

gianm Dec 13, 2016

Contributor

@cheddar, re:

That got me wondering if it should be introduced as an extension module?

The idea behind making it a core module is that I thought that at some point we would want to add an extension point for SQL UDFs to datasketches, approximate histograms, etc. To make that work, either the UDFs would need to be in the various extensions and they would need to require druid-sql, or the UDFs would need to be in druid-sql and it would need to require the various extensions.

The former made more sense to me, and given that implies druid-sql will one day be a dependency of a variety of extensions, it seemed like it belongs in core. Otherwise we get into extensions requiring other extensions that seem only lightly related, and that would be confusing to users.

If you have a better idea about how to handle extensions / UDFs with SQL then I am all ears.

Contributor

gianm commented Dec 13, 2016

@cheddar, re:

That got me wondering if it should be introduced as an extension module?

The idea behind making it a core module is that I thought that at some point we would want to add an extension point for SQL UDFs to datasketches, approximate histograms, etc. To make that work, either the UDFs would need to be in the various extensions and they would need to require druid-sql, or the UDFs would need to be in druid-sql and it would need to require the various extensions.

The former made more sense to me, and given that implies druid-sql will one day be a dependency of a variety of extensions, it seemed like it belongs in core. Otherwise we get into extensions requiring other extensions that seem only lightly related, and that would be confusing to users.

If you have a better idea about how to handle extensions / UDFs with SQL then I am all ears.

@gianm

This comment has been minimized.

Show comment
Hide comment
@gianm

gianm Dec 16, 2016

Contributor

Fixed conflicts with master.

Contributor

gianm commented Dec 16, 2016

Fixed conflicts with master.

@fjy fjy merged commit dd63f54 into apache:master Dec 17, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@gianm gianm deleted the gianm:sql branch Dec 19, 2016

@praveev

This comment has been minimized.

Show comment
Hide comment
@praveev

praveev Dec 23, 2016

Contributor

@gianm I believe there is a conflict with jackson-databind 2.6.3 from avatica.jar (comes from calcite-core) and druid's explicit jackson-databind 2.4.6 dependency. I see issues during batch ingestion. Realtime using tranquility is fine. Here is a gist containing the stack trace

Contributor

praveev commented Dec 23, 2016

@gianm I believe there is a conflict with jackson-databind 2.6.3 from avatica.jar (comes from calcite-core) and druid's explicit jackson-databind 2.4.6 dependency. I see issues during batch ingestion. Realtime using tranquility is fine. Here is a gist containing the stack trace

@gianm

This comment has been minimized.

Show comment
Hide comment
@gianm

gianm Dec 23, 2016

Contributor

@praveev could you raise a separate issue for this please? Does adding exclusions to the pom help?

Contributor

gianm commented Dec 23, 2016

@praveev could you raise a separate issue for this please? Does adding exclusions to the pom help?

@praveev

This comment has been minimized.

Show comment
Hide comment
@praveev

praveev Dec 23, 2016

Contributor

@gianm I've created #3800 to track this. Exclusions didn't help.

Contributor

praveev commented Dec 23, 2016

@gianm I've created #3800 to track this. Exclusions didn't help.

dgolitsyn added a commit to metamx/druid that referenced this pull request Feb 14, 2017

@gianm gianm added the Release Notes label Feb 23, 2017

@gianm gianm referenced this pull request Feb 28, 2017

Closed

Druid 0.10.0 release notes #3944

@jihoonson jihoonson referenced this pull request Jul 12, 2017

Merged

Protobuf extension #4039

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment