New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix] Isolate and improve performance on tagging system #7858
Conversation
scripts/create_initial_tags.py
Outdated
'query' AS object_type | ||
FROM saved_query | ||
JOIN tag | ||
ON tag.name = 'type:query'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be run multiple times and reach the same final state? If that's not the case it would be great to modify so that it is the case. What if the feature flag goes on/off/on/off over time and we want to create the bulk of missing tags? It seems like adding a simple LEFT JOIN (...) WHERE {remote table column} IS NULL
would enable that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If/when that's the case, we could add it as a cli command sync-tags
or something maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, currently this will fail if run multiple times. I'll fix it.
scripts/create_initial_tags.py
Outdated
literal(ObjectTypes.query.name).label('object_type'), | ||
]).select_from(join(saved_query, tag, tag.c.name == 'type:query')) | ||
|
||
combined = union_all(charts, dashboards, saved_queries).alias().select() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't 3 transactions instead of a bigger one be better? Not a big difference, but would allow to break down in 3 smaller chunks. If one section takes a long time or fails, we'd have more clarity as to where it's coming from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, especially if the script can be run multiple times. Originally I ran this with Superset stopped, so it wasn't a problem. I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for the bug fix!
Something that might be worth considering for future improvements is performing the tag table updates off the critical path. I don't see any reason to block returning to the client to perform the tagging updates after slices, dashboards, and favorites change. A few options:
- Make an non blocking sqlalchemy call (https://stackoverflow.com/questions/10214042/can-sqlalchemy-be-configured-to-be-non-blocking) then return to the user
- Queue up a celery task to perform the tag update, then return to the user
- Remove the event listeners entirely and just perform batch updates every 5 minutes or so in a celery task
@etr2460 I'm thinking tag update/maintenance should just be super cheap OLTP-type database operations executed in milliseconds, we just were missing the index... I think it's ok for them to be inline |
948644d
to
850bad5
Compare
@mistercrunch, can you do another pass in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM
scripts/create_initial_tags.py
Outdated
from superset.common.tags import add_favorites, add_owners, add_types | ||
|
||
|
||
def main(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script seems redundant with the cli superset sync_tags
, any reasons to have both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I'll remove it.
Codecov Report
@@ Coverage Diff @@
## master #7858 +/- ##
==========================================
- Coverage 65.96% 65.47% -0.49%
==========================================
Files 468 469 +1
Lines 22308 22381 +73
Branches 2432 2432
==========================================
- Hits 14715 14654 -61
- Misses 7472 7606 +134
Partials 121 121
Continue to review full report at Codecov.
|
CATEGORY
Choose one
SUMMARY
This PR improves the performance experienced with the tagging system, in 3 ways:
tagged_object
table now has an index to improve queries;TEST PLAN
I deleted all tags and tag associations and ran the script. Tags were recreated correctly. This was also tested in staging with real data back in April.
I also verified that the query reported by @graceguo-supercat (#7807) is using the index.
ADDITIONAL INFORMATION
REVIEWERS
@graceguo-supercat @michellethomas @mistercrunch