New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor dataframe and column name mutation logic #6847
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6847 +/- ##
==========================================
+ Coverage 63.83% 63.93% +0.09%
==========================================
Files 421 421
Lines 20447 20458 +11
Branches 2218 2218
==========================================
+ Hits 13053 13079 +26
+ Misses 7262 7247 -15
Partials 132 132
Continue to review full report at Codecov.
|
@@ -111,8 +111,10 @@ class BaseEngineSpec(object): | |||
time_secondary_columns = False | |||
inner_joins = True | |||
allows_subquery = True | |||
supports_column_aliases = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a comment on what this means, or just for our own understanding: It means not only that the DB supports the "AS" keyword but also it actually returns the columns with the specified alias name ... ie alias is supported both syntactically and semantically.
For example in Pinot, it does support the "AS" keyword syntactically but its not actually returned with it.
Interestingly as @betodealmeida points out, this is also violated at times by postgres: it returns the returned column name as 'sum(foo)' instead of 'SUM(foo)' sometimes.
So should we say that even for PG supports_column_aliases should be false ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to limit the use of supports_column_aliases
to engines that truly don't support aliases. Even though the new mutation logic makes the aliases less important, they still serve two purposes:
- Rendering a more descriptive query in the
View Query
tab that can be copied and used elsewhere. - Making the subqueries in more complex queries work. Several engines require forced quotes on aliases to resolve in the main query, and in some cases a column from a subquery can exceed the max column name length, in which case it needs to be mutated.
It is my understanding that Postgres requires quotes around case-sensitive alises. Eg. select 1 as x_Col
will return the header x_col
, while select 1 as "x_Col"
will return x_Col
. If I remember correctly psycopg2
does this quoting automatically, but if not, it might make sense to set force_column_alias_quotes
to True. I can test this more thoroughly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. My main question is wrt to should we always set labels_expected and stamp that back in ? I am curious if that would break something ? Perhaps something in the explore flow might break ? Thanks for cleaning this ! |
I am wondering if there are any semantic changes wrt to how the queries are created for each DB ? Longer term, this stuff is still fragile and not well unit tested. We should consider adding unit tests for this that actually assert the query shapes (perhaps by parsing out the generated queries using the sqlparser/moz-parser). The current test infra, surprisingly does not really test this: It merely runs the tests for three dummy test environments: No customer is probably going to use postgres, mysql or sqlite to do actual data warehouse visualizations. |
@agrawaldevesh agreed, it would be neat to have an |
Looks great now. I don't have any more comments :-). I will test this patch on my Pinot installation today too. |
I can confirm that the patch works on my Pinot installation. So I think it's good to go. |
Thanks, I'll give this one more spin on my test rig before submitting for committer review. |
@betodealmeida @mistercrunch This is now ready for review. |
…g column aliases and add column name length checking
FYI, I stumbled on some unrelated errors when testing this, will work through those first before removing WIP. |
@agrawaldevesh @betodealmeida @mistercrunch This is (again) ready for review, sorry for prematurely flagging this as done earlier. No big changes since @agrawaldevesh 's initial review, just some small refinements. |
LGTM |
LGTM to me too :-) |
@mistercrunch @betodealmeida , I am wondering when would this PR be merged in. As soon as this one goes in, I want to #6831 on top and resubmit that. |
Several small refactors, features and fixes related to SQLAlchemy engines:
db_engine_specs
. By default engines support aliases; this is only disabled on Pinot for now. If the engine doesn't support aliases, labels aren't added to columns (see screenshot below). This is done by replacing theself.get_label()
function with aself.make_sqla_column_compatible()
, that returns a label if those are supported and the original column if not.query_str_ext.labels_expected
.connectors/sqla/models.py
fromdb_engine_specs.py
and merge it with the column renaming logic from before.max_column_name_length
property todb_engine_specs
, and default to MD5 hash if column name is exceeded. Add max lengths to engines that were easy to find by googling.db_engine_specs
.Example
When creating a chart on Pinot where
COUNT(*)
is calculated grouping byplayerID
, the result shows up like before:Looking at the query, the alias is gone, making it regular PQL:
FYI @agrawaldevesh please check this and give comments when you have the time. I tested this with your fork of
pinot_dbapi
(I assume that's the one you are using for now).