[fix] Isolate and improve performance on tagging system #7858

betodealmeida · 2019-07-12T06:06:22Z

SUMMARY

This PR improves the performance experienced with the tagging system, in 3 ways:

Mutations only happen if the feature flag is enabled;
The tagged_object table now has an index to improve queries;
The migration script setting the initial tags is much faster.

TEST PLAN

I deleted all tags and tag associations and ran the script. Tags were recreated correctly. This was also tested in staging with real data back in April.

I also verified that the query reported by @graceguo-supercat (#7807) is using the index.

ADDITIONAL INFORMATION

REVIEWERS

@graceguo-supercat @michellethomas @mistercrunch

mistercrunch · 2019-07-12T15:51:51Z

scripts/create_initial_tags.py

+        'query' AS object_type
+      FROM saved_query
+      JOIN tag
+      ON tag.name = 'type:query';


Can this be run multiple times and reach the same final state? If that's not the case it would be great to modify so that it is the case. What if the feature flag goes on/off/on/off over time and we want to create the bulk of missing tags? It seems like adding a simple LEFT JOIN (...) WHERE {remote table column} IS NULL would enable that.

If/when that's the case, we could add it as a cli command sync-tags or something maybe.

Good point, currently this will fail if run multiple times. I'll fix it.

mistercrunch · 2019-07-12T15:54:12Z

scripts/create_initial_tags.py

+        literal(ObjectTypes.query.name).label('object_type'),
+    ]).select_from(join(saved_query, tag, tag.c.name == 'type:query'))
+
+    combined = union_all(charts, dashboards, saved_queries).alias().select()


Wouldn't 3 transactions instead of a bigger one be better? Not a big difference, but would allow to break down in 3 smaller chunks. If one section takes a long time or fails, we'd have more clarity as to where it's coming from.

You're right, especially if the script can be run multiple times. Originally I ran this with Superset stopped, so it wasn't a problem. I'll fix it.

etr2460

lgtm, thanks for the bug fix!

Something that might be worth considering for future improvements is performing the tag table updates off the critical path. I don't see any reason to block returning to the client to perform the tagging updates after slices, dashboards, and favorites change. A few options:

Make an non blocking sqlalchemy call (https://stackoverflow.com/questions/10214042/can-sqlalchemy-be-configured-to-be-non-blocking) then return to the user
Queue up a celery task to perform the tag update, then return to the user
Remove the event listeners entirely and just perform batch updates every 5 minutes or so in a celery task

mistercrunch · 2019-07-12T17:00:54Z

@etr2460 I'm thinking tag update/maintenance should just be super cheap OLTP-type database operations executed in milliseconds, we just were missing the index... I think it's ok for them to be inline

betodealmeida · 2019-07-31T01:48:34Z

@mistercrunch, can you do another pass in this PR?

mistercrunch

Minor comment, otherwise LGTM

mistercrunch · 2019-07-31T04:39:54Z

scripts/create_initial_tags.py

+from superset.common.tags import add_favorites, add_owners, add_types
+
+
+def main():


This script seems redundant with the cli superset sync_tags , any reasons to have both?

Yeah, good point. I'll remove it.

codecov-io · 2019-07-31T16:19:38Z

Codecov Report

Merging #7858 into master will decrease coverage by 0.48%.
The diff coverage is 13.41%.

@@            Coverage Diff             @@
##           master    #7858      +/-   ##
==========================================
- Coverage   65.96%   65.47%   -0.49%     
==========================================
  Files         468      469       +1     
  Lines       22308    22381      +73     
  Branches     2432     2432              
==========================================
- Hits        14715    14654      -61     
- Misses       7472     7606     +134     
  Partials      121      121

Impacted Files	Coverage Δ
superset/models/core.py	`81.12% <20%> (-1.9%)`	⬇️
superset/cli.py	`37.33% <42.85%> (+0.17%)`	⬆️
superset/common/tags.py	`9.23% <9.23%> (ø)`
superset/db_engine_specs/mysql.py	`34.88% <0%> (-58.14%)`	⬇️
superset/models/tags.py	`59.4% <0%> (-30.7%)`	⬇️
superset/views/core.py	`71% <0%> (-0.22%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b7261f...ed8bc9f. Read the comment docs.

pull-request-size bot added the size/L label Jul 12, 2019

betodealmeida added !deprecated-label:bug Deprecated label - Use #bug instead .database risk:db-migration PRs that require a DB migration labels Jul 12, 2019

mistercrunch reviewed Jul 12, 2019

View reviewed changes

etr2460 approved these changes Jul 12, 2019

View reviewed changes

betodealmeida added 4 commits July 30, 2019 11:45

Fix tag perf

269cd5d

Add ASF header

4d77fd2

Make script idempotent

a499ae2

Add CLI to sync tags

850bad5

betodealmeida force-pushed the improve_perf_tags branch from 948644d to 850bad5 Compare July 30, 2019 22:10

betodealmeida added 3 commits July 30, 2019 15:11

Add missing file

2eb4528

Merge heads

32b020d

Fix lint

506540e

mistercrunch approved these changes Jul 31, 2019

View reviewed changes

Remove script

ed8bc9f

betodealmeida merged commit 10f00cd into apache:master Jul 31, 2019

Carolinewly mentioned this pull request Jun 12, 2023

[Snyk] Upgrade deck.gl from 8.5.2 to 8.9.15 Carolinewly/superset#1

Open

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Isolate and improve performance on tagging system #7858

[fix] Isolate and improve performance on tagging system #7858

betodealmeida commented Jul 12, 2019

mistercrunch Jul 12, 2019

mistercrunch Jul 12, 2019

betodealmeida Jul 12, 2019

mistercrunch Jul 12, 2019

betodealmeida Jul 12, 2019

etr2460 left a comment

mistercrunch commented Jul 12, 2019

betodealmeida commented Jul 31, 2019

mistercrunch left a comment

mistercrunch Jul 31, 2019

betodealmeida Jul 31, 2019

codecov-io commented Jul 31, 2019

		from superset.common.tags import add_favorites, add_owners, add_types


		def main():

[fix] Isolate and improve performance on tagging system #7858

[fix] Isolate and improve performance on tagging system #7858

Conversation

betodealmeida commented Jul 12, 2019

CATEGORY

SUMMARY

TEST PLAN

ADDITIONAL INFORMATION

REVIEWERS

mistercrunch Jul 12, 2019

Choose a reason for hiding this comment

mistercrunch Jul 12, 2019

Choose a reason for hiding this comment

betodealmeida Jul 12, 2019

Choose a reason for hiding this comment

mistercrunch Jul 12, 2019

Choose a reason for hiding this comment

betodealmeida Jul 12, 2019

Choose a reason for hiding this comment

etr2460 left a comment

Choose a reason for hiding this comment

mistercrunch commented Jul 12, 2019

betodealmeida commented Jul 31, 2019

mistercrunch left a comment

Choose a reason for hiding this comment

mistercrunch Jul 31, 2019

Choose a reason for hiding this comment

betodealmeida Jul 31, 2019

Choose a reason for hiding this comment

codecov-io commented Jul 31, 2019

Codecov Report