Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor dataframe and column name mutation logic #6847

Merged
merged 13 commits into from Feb 21, 2019
Merged

Refactor dataframe and column name mutation logic #6847

merged 13 commits into from Feb 21, 2019

Conversation

villebro
Copy link
Member

@villebro villebro commented Feb 9, 2019

Several small refactors, features and fixes related to SQLAlchemy engines:

  • Add flag to disable aliases in db_engine_specs. By default engines support aliases; this is only disabled on Pinot for now. If the engine doesn't support aliases, labels aren't added to columns (see screenshot below). This is done by replacing the self.get_label() function with a self.make_sqla_column_compatible(), that returns a label if those are supported and the original column if not.
  • Merge old column name mutation logic with the new one via query_str_ext.labels_expected.
  • Move dataframe logic back to connectors/sqla/models.py from db_engine_specs.py and merge it with the column renaming logic from before.
  • Add max_column_name_length property to db_engine_specs, and default to MD5 hash if column name is exceeded. Add max lengths to engines that were easy to find by googling.
  • Fix a few edge cases where subqueries with double underscored aliases would not render properly if column length restriction was exceeded.
  • Fix a few linting errors in db_engine_specs.

Example

When creating a chart on Pinot where COUNT(*) is calculated grouping by playerID, the result shows up like before:

screenshot 2019-02-09 at 16 09 37

Looking at the query, the alias is gone, making it regular PQL:

screenshot 2019-02-09 at 16 09 25

FYI @agrawaldevesh please check this and give comments when you have the time. I tested this with your fork of pinot_dbapi (I assume that's the one you are using for now).

@codecov-io
Copy link

codecov-io commented Feb 9, 2019

Codecov Report

Merging #6847 into master will increase coverage by 0.09%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6847      +/-   ##
==========================================
+ Coverage   63.83%   63.93%   +0.09%     
==========================================
  Files         421      421              
  Lines       20447    20458      +11     
  Branches     2218     2218              
==========================================
+ Hits        13053    13079      +26     
+ Misses       7262     7247      -15     
  Partials      132      132
Impacted Files Coverage Δ
superset/connectors/sqla/models.py 81.7% <100%> (+0.18%) ⬆️
superset/db_engine_specs.py 56.3% <88.88%> (+2.48%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30cd0e3...92c3656. Read the comment docs.

@@ -111,8 +111,10 @@ class BaseEngineSpec(object):
time_secondary_columns = False
inner_joins = True
allows_subquery = True
supports_column_aliases = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a comment on what this means, or just for our own understanding: It means not only that the DB supports the "AS" keyword but also it actually returns the columns with the specified alias name ... ie alias is supported both syntactically and semantically.

For example in Pinot, it does support the "AS" keyword syntactically but its not actually returned with it.

Interestingly as @betodealmeida points out, this is also violated at times by postgres: it returns the returned column name as 'sum(foo)' instead of 'SUM(foo)' sometimes.

So should we say that even for PG supports_column_aliases should be false ?

Copy link
Member Author

@villebro villebro Feb 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to limit the use of supports_column_aliases to engines that truly don't support aliases. Even though the new mutation logic makes the aliases less important, they still serve two purposes:

  1. Rendering a more descriptive query in the View Query tab that can be copied and used elsewhere.
  2. Making the subqueries in more complex queries work. Several engines require forced quotes on aliases to resolve in the main query, and in some cases a column from a subquery can exceed the max column name length, in which case it needs to be mutated.

It is my understanding that Postgres requires quotes around case-sensitive alises. Eg. select 1 as x_Col will return the header x_col, while select 1 as "x_Col" will return x_Col. If I remember correctly psycopg2 does this quoting automatically, but if not, it might make sense to set force_column_alias_quotes to True. I can test this more thoroughly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this on Postgres and wasn't able to reproduce case sensitive cols not being automatically quoted. See below how lowercol(=lowercase) is not quoted but mixedCol(=mixed case) is automatically quoted in the main query.
screenshot 2019-02-11 at 17 40 51

@agrawaldevesh
Copy link
Contributor

LGTM. My main question is wrt to should we always set labels_expected and stamp that back in ? I am curious if that would break something ? Perhaps something in the explore flow might break ?

Thanks for cleaning this !

@agrawaldevesh
Copy link
Contributor

I am wondering if there are any semantic changes wrt to how the queries are created for each DB ?

Longer term, this stuff is still fragile and not well unit tested. We should consider adding unit tests for this that actually assert the query shapes (perhaps by parsing out the generated queries using the sqlparser/moz-parser). The current test infra, surprisingly does not really test this: It merely runs the tests for three dummy test environments: No customer is probably going to use postgres, mysql or sqlite to do actual data warehouse visualizations.

@villebro
Copy link
Member Author

@agrawaldevesh agreed, it would be neat to have an extras_require section for every officially supported engine in setup.py, with some minimal semantic validation for each. Also, if commercial DWs like BigQuery, Snowflake, Athena, Redshift would like to offer free testing servers, more comprehensive integration tests could be added for those.

@agrawaldevesh
Copy link
Contributor

Looks great now. I don't have any more comments :-). I will test this patch on my Pinot installation today too.

@agrawaldevesh
Copy link
Contributor

I can confirm that the patch works on my Pinot installation. So I think it's good to go.

@villebro
Copy link
Member Author

Thanks, I'll give this one more spin on my test rig before submitting for committer review.

@kristw kristw added change:backend Requires changing the backend #code-quality labels Feb 10, 2019
@villebro villebro changed the title [WIP] Refactor dataframe and column name mutation logic Refactor dataframe and column name mutation logic Feb 11, 2019
@villebro
Copy link
Member Author

@betodealmeida @mistercrunch This is now ready for review.

@villebro villebro changed the title Refactor dataframe and column name mutation logic [WIP] Refactor dataframe and column name mutation logic Feb 12, 2019
@villebro
Copy link
Member Author

FYI, I stumbled on some unrelated errors when testing this, will work through those first before removing WIP.

@villebro villebro changed the title [WIP] Refactor dataframe and column name mutation logic Refactor dataframe and column name mutation logic Feb 15, 2019
@villebro
Copy link
Member Author

@agrawaldevesh @betodealmeida @mistercrunch This is (again) ready for review, sorry for prematurely flagging this as done earlier. No big changes since @agrawaldevesh 's initial review, just some small refinements.

@mistercrunch
Copy link
Member

LGTM

@agrawaldevesh
Copy link
Contributor

LGTM to me too :-)

@agrawaldevesh
Copy link
Contributor

@mistercrunch @betodealmeida , I am wondering when would this PR be merged in. As soon as this one goes in, I want to #6831 on top and resubmit that.

@mistercrunch mistercrunch merged commit f5277fe into apache:master Feb 21, 2019
@villebro villebro deleted the refactor_label_mutator branch February 24, 2019 19:16
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels change:backend Requires changing the backend 🚢 0.34.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants