Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(explore): Prevent unnecessary series limit subquery #21154

Merged
merged 4 commits into from
Aug 26, 2022

Conversation

codyml
Copy link
Member

@codyml codyml commented Aug 22, 2022

SUMMARY

This PR attempts to fix a bug with the series limit control when GENERIC_CHART_AXES feature is turned on. Steps to reproduce bug:

  • Create a "Generic Chart"-type chart with the X-axis and a metric set. Don't add any dimensions. Note the number of X-values present on the chart.
  • Set the Series Limit to something lower than the number of X-values present on the chart and refresh. Note how X-values are excluded.

Upon investigation, it looks like the SQL query builder is adding a perhaps unnecessary join against a subquery whenever Series Limit is set, regardless of if there is actually a dimension set that results in series needing limiting. This PR attempts to make the join against the subquery only included when the frontend includes values in the series_columns array in the query context.

Update: Thanks to @villebro's input, this PR now handles legacy time-series charts as before by checking the is_timeseries param.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before:
localhost_9000_superset_welcome_

After:
localhost_9000_superset_welcome_ (1)

TESTING INSTRUCTIONS

  • Confirm baseline behavior with GENERIC_CHART_AXES disabled:
    • Create an ECharts time-series chart (e.g. ECharts > Time-series Line Chart) with 1+ metrics but no dimensions.
      • The chart X-axis should contain all time values, accounting for time grain. The "View query" modal should show no join present.
      • Setting Series Limit to a value less than the number of time values present on the chart X-axis should have no effect, and the query should not contain a join.
      • Adding a dimension then setting Series Limit to a value less than the number of dimension values should limit the number of rendered series to the series limit, and the query should now contain a join.
    • The above should also hold true for a legacy, non-ECharts time-series chart (e.g. nvd3 > Line Chart).
  • Enable GENERIC_CHART_AXES and test the same:
    • Create an ECharts time-series chart (e.g. ECharts > Line Chart) with 1+ metrics but no dimensions. Set the X-axis to the time column.
      • The chart X-axis should contain all time values, accounting for time grain. The "View query" modal should show no join present.
      • Setting Series Limit to a value less than the number of time values present on the chart X-axis should have no effect, and the query should not contain a join.
      • Adding a dimension then setting Series Limit to a value less than the number of dimension values should limit the number of rendered series to the series limit, and the query should now contain a join.
    • The above should also hold true for a legacy, non-ECharts time-series chart (e.g. nvd3 > Line Chart).

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags: GENERIC_CHART_AXES
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@codecov
Copy link

codecov bot commented Aug 22, 2022

Codecov Report

Merging #21154 (e4425eb) into master (d568999) will decrease coverage by 0.02%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master   #21154      +/-   ##
==========================================
- Coverage   66.39%   66.37%   -0.03%     
==========================================
  Files        1781     1781              
  Lines       67885    67885              
  Branches     7244     7244              
==========================================
- Hits        45073    45059      -14     
- Misses      20953    20967      +14     
  Partials     1859     1859              
Flag Coverage Δ
hive 53.17% <50.00%> (ø)
mysql ?
postgres 81.04% <50.00%> (ø)
presto 53.07% <50.00%> (ø)
python 81.47% <50.00%> (-0.05%) ⬇️
sqlite 79.64% <50.00%> (ø)
unit 50.73% <50.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/models/helpers.py 39.31% <0.00%> (ø)
superset/connectors/sqla/models.py 90.32% <100.00%> (ø)
superset/common/utils/dataframe_utils.py 90.47% <0.00%> (-4.77%) ⬇️
superset/db_engine_specs/mysql.py 94.04% <0.00%> (-4.77%) ⬇️
superset/models/core.py 88.03% <0.00%> (-0.72%) ⬇️
superset/views/core.py 75.15% <0.00%> (-0.46%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jinghua-qa
Copy link
Member

/testenv up FEATURE_GENERIC_CHART_AXES=true

@github-actions
Copy link
Contributor

@jinghua-qa Ephemeral environment spinning up at http://54.200.17.110:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

@@ -1410,7 +1410,7 @@ def get_sqla_query( # pylint: disable=too-many-arguments,too-many-locals,too-ma
col=selected, template_processor=template_processor
)
groupby_all_columns[outer.name] = outer
if not series_column_names or outer.name in series_column_names:
if outer.name in series_column_names:
Copy link
Member

@hughhhh hughhhh Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this line for the helpers.py as well?

https://github.com/preset-io/superset/blob/cf02934b21ea608557ca600c28efc680647ef130/superset/models/helpers.py#L1518

I'm currently refactoring this piece of code for SIP-69 models via a Mixin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work - legacy timeseries charts will have set is_timeseries to True, and new charts will no longer be using it:

Suggested change
if outer.name in series_column_names:
if (
is_timeseries and not series_column_names
) or outer.name in series_column_names:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro That makes sense to me, thanks! I can do some double-checking with the legacy timeseries charts – which ones are you talking about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legacy time series line, bar etc charts. Essentially the ones with is_timeseries set to True and that are served by viz.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great, I'll update the PR ASAP!

@villebro
Copy link
Member

@codyml thanks for working on this - this is indeed an important fix. One of the reasons for that check is because legacy timeseries charts don't populate series_columns, and just assume columns or groupby is equal to series_columns (see #16660 for more context on this change). At the time we chose to ensure that the current functionality is unchanged, thereby adding that check. So I think removing it may cause a regression in some old charts. Let me see if I can propose an alternative solution that fixes the issue with GENERIC_CHART_AXES without introducing regression risk into old charts (give me a day or two).

@codyml
Copy link
Member Author

codyml commented Aug 23, 2022

@codyml thanks for working on this - this is indeed an important fix. One of the reasons for that check is because legacy timeseries charts don't populate series_columns, and just assume columns or groupby is equal to series_columns (see #16660 for more context on this change). At the time we chose to ensure that the current functionality is unchanged, thereby adding that check. So I think removing it may cause a regression in some old charts. Let me see if I can propose an alternative solution that fixes the issue with GENERIC_CHART_AXES without introducing regression risk into old charts (give me a day or two).

@villebro Thanks for the info, that's the kind of thing I was worried about. One idea I had was to extend the query object to include the x_axis field that the frontend sends in form_data, and use that to conditionally exclude the x_axis column in that part of the code, which hopefully would leave legacy charts that don't support generic x-axis unaffected. Do you think that would work?

@villebro
Copy link
Member

villebro commented Aug 24, 2022

Thanks for the info, that's the kind of thing I was worried about. One idea I had was to extend the query object to include the x_axis field that the frontend sends in form_data, and use that to conditionally exclude the x_axis column in that part of the code, which hopefully would leave legacy charts that don't support generic x-axis unaffected. Do you think that would work?

IMO including x_axis in the query object would be what we in Swedish call "tårta på tårta" ("cake on cake" in English referring to unnecessary redundancy), as x_axis should always be the complement of series_columns (x_axis + series_columns = columns). Incidentally, if we do want to reference x_axis in that part of the codebase, I would recommend calling it base_axes and making it a list/tuple of columns, as we may want to start introducing 3D charts or similar at some point, in which case there would be two base axes.

But back to the problem at hand - I have a hypothesis that I will validate now that should be uninvasive. Let me try it out and post my findings so we can get this fix merged asap.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think of the proposed solution

@@ -1410,7 +1410,7 @@ def get_sqla_query( # pylint: disable=too-many-arguments,too-many-locals,too-ma
col=selected, template_processor=template_processor
)
groupby_all_columns[outer.name] = outer
if not series_column_names or outer.name in series_column_names:
if outer.name in series_column_names:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work - legacy timeseries charts will have set is_timeseries to True, and new charts will no longer be using it:

Suggested change
if outer.name in series_column_names:
if (
is_timeseries and not series_column_names
) or outer.name in series_column_names:

@codyml codyml changed the title [WIP] fix(explore): Prevent unnecessary series limit subquery fix(explore): Prevent unnecessary series limit subquery Aug 24, 2022
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Adding an integration test would be great, but as this appears to be a critical hotfix we can probably wait a few days before adding one (I can help if needed).

@github-actions
Copy link
Contributor

Storybook has completed and can be viewed at

Copy link
Member

@jinghua-qa jinghua-qa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with temporarily skip the test and would like to see it fixed asap.

@rusackas rusackas merged commit 0726840 into apache:master Aug 26, 2022
@github-actions
Copy link
Contributor

Ephemeral environment shutdown and build artifacts deleted.

@mistercrunch
Copy link
Member

Nice! thank you for fixing this!

eschutho pushed a commit that referenced this pull request Sep 20, 2022
* Prevent series limit when no series limit columns specified.

* Add timeseries check for legacy charts.

* Apply fix to helpers.py.

* Skip Cypress color consistency tests.
AAfghahi pushed a commit that referenced this pull request Oct 5, 2022
* Prevent series limit when no series limit columns specified.

* Add timeseries check for legacy charts.

* Apply fix to helpers.py.

* Skip Cypress color consistency tests.
AAfghahi pushed a commit that referenced this pull request Oct 6, 2022
* Prevent series limit when no series limit columns specified.

* Add timeseries check for legacy charts.

* Apply fix to helpers.py.

* Skip Cypress color consistency tests.
Fahrenheit35 pushed a commit to Fahrenheit35/superset that referenced this pull request Nov 11, 2022
* Prevent series limit when no series limit columns specified.

* Add timeseries check for legacy charts.

* Apply fix to helpers.py.

* Skip Cypress color consistency tests.
@mistercrunch mistercrunch added 🍒 2.0.1 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.1.0 and removed 🚢 2.1.3 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/S v2.0 v2.0.1 🍒 2.0.1 🚢 2.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants