Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] Isolate and improve performance on tagging system #7858

Merged
merged 8 commits into from Jul 31, 2019

Conversation

betodealmeida
Copy link
Member

CATEGORY

Choose one

  • Bug Fix
  • Enhancement (new features, refinement)
  • Refactor
  • Add tests
  • Build / Development Environment
  • Documentation

SUMMARY

This PR improves the performance experienced with the tagging system, in 3 ways:

  1. Mutations only happen if the feature flag is enabled;
  2. The tagged_object table now has an index to improve queries;
  3. The migration script setting the initial tags is much faster.

TEST PLAN

I deleted all tags and tag associations and ran the script. Tags were recreated correctly. This was also tested in staging with real data back in April.

I also verified that the query reported by @graceguo-supercat (#7807) is using the index.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

REVIEWERS

@graceguo-supercat @michellethomas @mistercrunch

@betodealmeida betodealmeida added !deprecated-label:bug Deprecated label - Use #bug instead .database risk:db-migration PRs that require a DB migration labels Jul 12, 2019
'query' AS object_type
FROM saved_query
JOIN tag
ON tag.name = 'type:query';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be run multiple times and reach the same final state? If that's not the case it would be great to modify so that it is the case. What if the feature flag goes on/off/on/off over time and we want to create the bulk of missing tags? It seems like adding a simple LEFT JOIN (...) WHERE {remote table column} IS NULL would enable that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If/when that's the case, we could add it as a cli command sync-tags or something maybe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, currently this will fail if run multiple times. I'll fix it.

literal(ObjectTypes.query.name).label('object_type'),
]).select_from(join(saved_query, tag, tag.c.name == 'type:query'))

combined = union_all(charts, dashboards, saved_queries).alias().select()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't 3 transactions instead of a bigger one be better? Not a big difference, but would allow to break down in 3 smaller chunks. If one section takes a long time or fails, we'd have more clarity as to where it's coming from.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, especially if the script can be run multiple times. Originally I ran this with Superset stopped, so it wasn't a problem. I'll fix it.

Copy link
Member

@etr2460 etr2460 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for the bug fix!

Something that might be worth considering for future improvements is performing the tag table updates off the critical path. I don't see any reason to block returning to the client to perform the tagging updates after slices, dashboards, and favorites change. A few options:

@mistercrunch
Copy link
Member

@etr2460 I'm thinking tag update/maintenance should just be super cheap OLTP-type database operations executed in milliseconds, we just were missing the index... I think it's ok for them to be inline

@betodealmeida
Copy link
Member Author

@mistercrunch, can you do another pass in this PR?

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, otherwise LGTM

from superset.common.tags import add_favorites, add_owners, add_types


def main():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script seems redundant with the cli superset sync_tags , any reasons to have both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. I'll remove it.

@codecov-io
Copy link

Codecov Report

Merging #7858 into master will decrease coverage by 0.48%.
The diff coverage is 13.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7858      +/-   ##
==========================================
- Coverage   65.96%   65.47%   -0.49%     
==========================================
  Files         468      469       +1     
  Lines       22308    22381      +73     
  Branches     2432     2432              
==========================================
- Hits        14715    14654      -61     
- Misses       7472     7606     +134     
  Partials      121      121
Impacted Files Coverage Δ
superset/models/core.py 81.12% <20%> (-1.9%) ⬇️
superset/cli.py 37.33% <42.85%> (+0.17%) ⬆️
superset/common/tags.py 9.23% <9.23%> (ø)
superset/db_engine_specs/mysql.py 34.88% <0%> (-58.14%) ⬇️
superset/models/tags.py 59.4% <0%> (-30.7%) ⬇️
superset/views/core.py 71% <0%> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b7261f...ed8bc9f. Read the comment docs.

@betodealmeida betodealmeida merged commit 10f00cd into apache:master Jul 31, 2019
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels !deprecated-label:bug Deprecated label - Use #bug instead risk:db-migration PRs that require a DB migration size/L 🚢 0.34.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants