Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-11] Proposal for deprecating the native Druid NoSQL connector #6032

Closed
john-bodley opened this issue Oct 4, 2018 · 17 comments
Closed
Labels
.pinned Draws attention sip Superset Improvement Proposal
Projects

Comments

@john-bodley
Copy link
Member

Motivation

Superset currently supports two engine connectors for querying datasources; SQLAlchemy and the Druid REST API. The later was the initial use case for Superset, i.e., a UI for visualizing Druid datasources.

Since version 0.10.0 Druid has included a built-it SQL server which has a SQLAlchemy binding provided by the pydruid library (courtesy of @betodealmeida and @mistercrunch) and thus the proposed change is to deprecate the REST API interface in favor of having a single interface (SQLAlchemy) to all engines. Note all future engines (there has been mentioned of adding support for Elasticsearch) would require a SQLAlchemy dialect.

There is a non-insignificant amount of overhead in supporting both connectors including:

Code

From a code perspective each connector needs to define similar views and models. The Druid connector alone comprises of around 2,000 lines of code. There is additional frontend logic which needs to construct filters, metrics, etc. for both the Druid REST API and SQLAlchemy. Note there are 74 files (including documentation) which reference Druid in the repo.

Models

In addition to code overhead each connector defines its own models and database tables:

Druid:

  • clusters
  • datasources
  • columns
  • metrics

SQLAlchemy:

  • dbs
  • tables
  • table_columns
  • sql_metric

which complicates logic, i.e., the slices table does not have a SQLAlchemy relationship to a "datasource" table as the datasource type determines the association. This results in denormalized tables with potentially incorrect values, i.e., the slices table contains the datasource_name column for the FAB CRUD views, however this may not accurately reflect the underlying datasource name.

Proposed Change

The proposed change would be to deprecate all the Druid REST logic from the codebase. This significantly simplifies and streamlines a number of facets of Superset by ensuring that all engines connect via a SQLAlchemy dialect.

Currently there is support for syncing/refreshing Druid datasource associated with the REST API connector which I suspect is leveraged by a number of organizations. SIP-7 discussing "refreshing" of Superset datasources.

Note this would be a breaking change for any organizing using a Druid version less than 0.10.0. Also there may be some instances of post-aggregate Druid functions which are not supported in Druid SQL.

New or Changed Public Interfaces

There would be no new or changed public interfaces.

New dependencies

There would be no new dependencies.

Migration Plan and Compatibility

A non-trivial database migration would be required including:

  • All records in the Druid tables listed above would need to be migrated to the SQLAlchemy equivalent table.
  • Existing slices would need to be updated to reference the new SQLAlchemy representation of the Druid datasource.
  • Re-normalize the slices table.
  • Update chart data to remove the obsolete table__ or druid__ prefixes.

Rejected Alternatives

None.

to: @betodealmeida @graceguo-supercat @kristw @michellethomas @mistercrunch @timifasubaa

@kristw
Copy link
Contributor

kristw commented Oct 4, 2018

I am in favor of consolidating and standardizing in general. 👍
This should reduce bugs and improve maintainability.

@mistercrunch
Copy link
Member

That will effectively mean deprecating support for pre-good-SQL-support druid versions, which I think is fine. If people want to use old Druid, they can use old Superset with it :)

@john-bodley
Copy link
Member Author

@mistercrunch are you aware of any functionality that the REST API provides (such as phase II queries) that couldn’t be captured in Druid SQL?

@john-bodley john-bodley added the sip Superset Improvement Proposal label Oct 4, 2018
@kkalyan
Copy link
Contributor

kkalyan commented Oct 8, 2018

How are Druid dimension extractions/lookup, filtered aggregations and javascript post-aggregators with Druid SQL? Druid users use them heavily. Some of these are not native to SQL, it would be good to support them.

@datametrics
Copy link

I use the REST API of Druid heavily. It makes it totally easy to make discovery of datasources as well as to implement Druid clients from other languages. The rest format is platform interchangeable and there is no need to implement any further sql parser / converter logic. One can just throw some model classes together and serialise that to Json. From my point of view it would be great when superset continues with REST support or at least leaves the opportunity for connector injection.

@SpyderRivera
Copy link
Contributor

Druid SQL won't support core features like multi-value dimensions until 0.13.1

@akuckartz
Copy link

Is SQLAlchemy a REST API ?

@stale
Copy link

stale bot commented Apr 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the inactive Inactive for >= 30 days label Apr 10, 2019
@kristw kristw removed the inactive Inactive for >= 30 days label Apr 10, 2019
@mistercrunch mistercrunch added the .pinned Draws attention label Nov 15, 2019
@john-bodley
Copy link
Member Author

john-bodley commented Dec 11, 2019

#8512 introduced a feature flag which enables the Druid NoSQL connector. Note by default this is disabled.

@mistercrunch mistercrunch added this to pre-DISCUSS in SIPs Feb 24, 2020
@AnkitaSanyogita
Copy link

Superset is used for visualization and exploration. While SQL type data can be sliced and these slices can be secured, is it possible to do so in case of druid datasource. User may only be privileged to see data for a certain dimension from druid datasource. We could use Rest API connector to pass some header information and secure information. On one hand druid promotes de-normalization, hence adding all dimensions in flattened datasource, but those dimensions are hooks and handles of user data rights. I am not sure if this use case is possible through Druid SQL. I would like to see this feature being there and strengthened in Superset.

@dharamgajera
Copy link

dharamgajera commented Jul 28, 2020

A lot of people use Superset just because it has native support for Druid's functionalities like calculated columns, aggregations, post-aggregations, etc. And is the only reason to use Superset for companies like ours.
Druid SQL lacks in functionality compared to REST API and is the reason why Superset is preferred over competitors like Metabase, Redash, etc.

@mistercrunch
Copy link
Member

@dharamgajera what does Druid SQL lack at this time?

@dharamgajera
Copy link

dharamgajera commented Jul 29, 2020

@mistercrunch Official Druid Docs - https://druid.apache.org/docs/latest/querying/sql.html#unsupported-features
Apart from this, Writing Druid SQL queries in Superset gets really complex and time consuming.

  1. One of the most important, applying Dashboard level Time Filter on SQL Query chart.
  2. If there is any change in a calculated column/metric, each and every SQL query needs to be changed.

These are just based on my usage of Superset so far. I have used and evaluated Metabase and Redash as well. But ease and functionalities (Calculated Columns) in Superset made me reject others.

UPDATE :
My Bad ! I overlooked calculated column/metric feature in SQL. The same is available in SQL as well.

@mistercrunch
Copy link
Member

fLAMxqKl_400x400

@wchaofan
Copy link

wchaofan commented Jun 9, 2021

hi @john-bodley , How do I connect to my druid cluster using the superset of 1.1.0?
I set DRUID_IS_ACTIVE=True ,but this was not work.The navigation still has no Druid Cluster Menu item。

@john-bodley john-bodley changed the title [SIP-11] Proposal for deprecating Druid REST connector [SIP-11] Proposal for deprecating the native Druid NoSQL connector Jun 23, 2021
@john-bodley
Copy link
Member Author

@wchaofan apologies I just saw your comment. The DRUID_IS_ACTIVE flag is somewhat misleading and relates to whether the native Druid NoSQL connector is exposed. Per the SIP given that the connector is slated for deprecation I would strongly recommend that you register your Druid cluster as a SQL database.

@john-bodley
Copy link
Member Author

For reference Airbnb is slated to have migrated all of their charts using the native Druid NoSQL connector to Druid SQL by mid August.

I’m working with @villebro to share out the internal work (documentation, migration logic, scripts, etc.) we’ve undertaken which could be generalized for other installations hoping to migrate.

Note there are some non-compatible differences and thus not every chart can be migrated. In terms of ad-hoc filters Druid 0.19.0 introduced a REGEX_LIKE SQL UDF which ensures parity with the regex comparator for the Druid NoSQL connector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.pinned Draws attention sip Superset Improvement Proposal
Projects
Status: Implemented / Done
SIPs
IMPLEMENTED / DONE
Development

No branches or pull requests

10 participants