Refactor dataframe and column name mutation logic #6847

villebro · 2019-02-09T14:46:04Z

Several small refactors, features and fixes related to SQLAlchemy engines:

Add flag to disable aliases in db_engine_specs. By default engines support aliases; this is only disabled on Pinot for now. If the engine doesn't support aliases, labels aren't added to columns (see screenshot below). This is done by replacing the self.get_label() function with a self.make_sqla_column_compatible(), that returns a label if those are supported and the original column if not.
Merge old column name mutation logic with the new one via query_str_ext.labels_expected.
Move dataframe logic back to connectors/sqla/models.py from db_engine_specs.py and merge it with the column renaming logic from before.
Add max_column_name_length property to db_engine_specs, and default to MD5 hash if column name is exceeded. Add max lengths to engines that were easy to find by googling.
Fix a few edge cases where subqueries with double underscored aliases would not render properly if column length restriction was exceeded.
Fix a few linting errors in db_engine_specs.

Example

When creating a chart on Pinot where COUNT(*) is calculated grouping by playerID, the result shows up like before:

Looking at the query, the alias is gone, making it regular PQL:

FYI @agrawaldevesh please check this and give comments when you have the time. I tested this with your fork of pinot_dbapi (I assume that's the one you are using for now).

codecov-io · 2019-02-09T15:17:04Z

Codecov Report

Merging #6847 into master will increase coverage by 0.09%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##           master    #6847      +/-   ##
==========================================
+ Coverage   63.83%   63.93%   +0.09%     
==========================================
  Files         421      421              
  Lines       20447    20458      +11     
  Branches     2218     2218              
==========================================
+ Hits        13053    13079      +26     
+ Misses       7262     7247      -15     
  Partials      132      132

Impacted Files	Coverage Δ
superset/connectors/sqla/models.py	`81.7% <100%> (+0.18%)`	⬆️
superset/db_engine_specs.py	`56.3% <88.88%> (+2.48%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30cd0e3...92c3656. Read the comment docs.

superset/connectors/sqla/models.py

agrawaldevesh · 2019-02-10T04:25:20Z

superset/db_engine_specs.py

@@ -111,8 +111,10 @@ class BaseEngineSpec(object):
    time_secondary_columns = False
    inner_joins = True
    allows_subquery = True
+    supports_column_aliases = True


It would be good to add a comment on what this means, or just for our own understanding: It means not only that the DB supports the "AS" keyword but also it actually returns the columns with the specified alias name ... ie alias is supported both syntactically and semantically.

For example in Pinot, it does support the "AS" keyword syntactically but its not actually returned with it.

Interestingly as @betodealmeida points out, this is also violated at times by postgres: it returns the returned column name as 'sum(foo)' instead of 'SUM(foo)' sometimes.

So should we say that even for PG supports_column_aliases should be false ?

I think it's good to limit the use of supports_column_aliases to engines that truly don't support aliases. Even though the new mutation logic makes the aliases less important, they still serve two purposes:

Rendering a more descriptive query in the View Query tab that can be copied and used elsewhere.

Making the subqueries in more complex queries work. Several engines require forced quotes on aliases to resolve in the main query, and in some cases a column from a subquery can exceed the max column name length, in which case it needs to be mutated.

It is my understanding that Postgres requires quotes around case-sensitive alises. Eg. select 1 as x_Col will return the header x_col, while select 1 as "x_Col" will return x_Col. If I remember correctly psycopg2 does this quoting automatically, but if not, it might make sense to set force_column_alias_quotes to True. I can test this more thoroughly.

I tested this on Postgres and wasn't able to reproduce case sensitive cols not being automatically quoted. See below how lowercol(=lowercase) is not quoted but mixedCol(=mixed case) is automatically quoted in the main query.

superset/connectors/sqla/models.py

agrawaldevesh · 2019-02-10T04:31:41Z

LGTM. My main question is wrt to should we always set labels_expected and stamp that back in ? I am curious if that would break something ? Perhaps something in the explore flow might break ?

Thanks for cleaning this !

superset/connectors/sqla/models.py

agrawaldevesh · 2019-02-10T04:49:28Z

I am wondering if there are any semantic changes wrt to how the queries are created for each DB ?

Longer term, this stuff is still fragile and not well unit tested. We should consider adding unit tests for this that actually assert the query shapes (perhaps by parsing out the generated queries using the sqlparser/moz-parser). The current test infra, surprisingly does not really test this: It merely runs the tests for three dummy test environments: No customer is probably going to use postgres, mysql or sqlite to do actual data warehouse visualizations.

villebro · 2019-02-10T18:28:44Z

@agrawaldevesh agreed, it would be neat to have an extras_require section for every officially supported engine in setup.py, with some minimal semantic validation for each. Also, if commercial DWs like BigQuery, Snowflake, Athena, Redshift would like to offer free testing servers, more comprehensive integration tests could be added for those.

agrawaldevesh · 2019-02-10T18:38:25Z

Looks great now. I don't have any more comments :-). I will test this patch on my Pinot installation today too.

agrawaldevesh · 2019-02-10T19:13:57Z

I can confirm that the patch works on my Pinot installation. So I think it's good to go.

villebro · 2019-02-10T19:17:01Z

Thanks, I'll give this one more spin on my test rig before submitting for committer review.

villebro · 2019-02-11T20:20:26Z

@betodealmeida @mistercrunch This is now ready for review.

…g column aliases and add column name length checking

villebro · 2019-02-13T07:44:45Z

FYI, I stumbled on some unrelated errors when testing this, will work through those first before removing WIP.

villebro · 2019-02-15T16:21:17Z

@agrawaldevesh @betodealmeida @mistercrunch This is (again) ready for review, sorry for prematurely flagging this as done earlier. No big changes since @agrawaldevesh 's initial review, just some small refinements.

mistercrunch · 2019-02-17T22:26:14Z

LGTM

agrawaldevesh · 2019-02-19T07:40:16Z

LGTM to me too :-)

agrawaldevesh · 2019-02-20T17:50:27Z

@mistercrunch @betodealmeida , I am wondering when would this PR be merged in. As soon as this one goes in, I want to #6831 on top and resubmit that.

agrawaldevesh reviewed Feb 10, 2019

View reviewed changes

superset/connectors/sqla/models.py Outdated Show resolved Hide resolved

agrawaldevesh reviewed Feb 10, 2019

View reviewed changes

superset/connectors/sqla/models.py Outdated Show resolved Hide resolved

agrawaldevesh reviewed Feb 10, 2019

View reviewed changes

superset/connectors/sqla/models.py Outdated Show resolved Hide resolved

agrawaldevesh reviewed Feb 10, 2019

View reviewed changes

superset/connectors/sqla/models.py Show resolved Hide resolved

kristw added change:backend Requires changing the backend #code-quality labels Feb 10, 2019

villebro changed the title ~~[WIP] Refactor dataframe and column name mutation logic~~ Refactor dataframe and column name mutation logic Feb 11, 2019

villebro changed the title ~~Refactor dataframe and column name mutation logic~~ [WIP] Refactor dataframe and column name mutation logic Feb 12, 2019

villebro added 9 commits February 12, 2019 19:36

Merge dataframe and column name mutation logic, add flag for disablin…

ba90ecf

…g column aliases and add column name length checking

Remove custome mutate_label from oracle spec

171c3db

Move hashing from mutate_label() to make_label_compatible()

aa10230

Remove empty line

21a969b

Make label mutating and truncating more robust

ac4f3d9

Rename variables and make proposed changes from review

32fe0c0

Always execute labels_expected codepath

faccc81

Fix linting error

a32f61e

Add comments and fix subquery errors

ced1c63

Refine column compatibility

e3b289d

villebro changed the title ~~[WIP] Refactor dataframe and column name mutation logic~~ Refactor dataframe and column name mutation logic Feb 15, 2019

Simplify label assignment

43e6bf0

Add unit tests for BQ and Oracle

814655a

Linting

92c3656

mistercrunch merged commit f5277fe into apache:master Feb 21, 2019

villebro deleted the refactor_label_mutator branch February 24, 2019 19:16

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dataframe and column name mutation logic #6847

Refactor dataframe and column name mutation logic #6847

villebro commented Feb 9, 2019 •

edited

codecov-io commented Feb 9, 2019 •

edited

agrawaldevesh Feb 10, 2019

villebro Feb 10, 2019 •

edited

villebro Feb 11, 2019

agrawaldevesh commented Feb 10, 2019

agrawaldevesh commented Feb 10, 2019

villebro commented Feb 10, 2019

agrawaldevesh commented Feb 10, 2019

agrawaldevesh commented Feb 10, 2019

villebro commented Feb 10, 2019

villebro commented Feb 11, 2019

villebro commented Feb 13, 2019

villebro commented Feb 15, 2019

mistercrunch commented Feb 17, 2019

agrawaldevesh commented Feb 19, 2019

agrawaldevesh commented Feb 20, 2019

Refactor dataframe and column name mutation logic #6847

Refactor dataframe and column name mutation logic #6847

Conversation

villebro commented Feb 9, 2019 • edited

Example

codecov-io commented Feb 9, 2019 • edited

Codecov Report

agrawaldevesh Feb 10, 2019

Choose a reason for hiding this comment

villebro Feb 10, 2019 • edited

Choose a reason for hiding this comment

villebro Feb 11, 2019

Choose a reason for hiding this comment

agrawaldevesh commented Feb 10, 2019

agrawaldevesh commented Feb 10, 2019

villebro commented Feb 10, 2019

agrawaldevesh commented Feb 10, 2019

agrawaldevesh commented Feb 10, 2019

villebro commented Feb 10, 2019

villebro commented Feb 11, 2019

villebro commented Feb 13, 2019

villebro commented Feb 15, 2019

mistercrunch commented Feb 17, 2019

agrawaldevesh commented Feb 19, 2019

agrawaldevesh commented Feb 20, 2019

villebro commented Feb 9, 2019 •

edited

codecov-io commented Feb 9, 2019 •

edited

villebro Feb 10, 2019 •

edited