Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(migration): Ensure the paginated update is deterministic #21778

Merged

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Oct 12, 2022

SUMMARY

This PR fixes an issue with the paginated_update method which is used in a number of migrations. The problem is the pagination was not deterministic, i.e., per iteration it slices the query via a SQL statement using an OFFSET and LIMIT. The issue is if the results are not ordered in a consistent way, i.e., by primary key, hence the ordering of sliced results is random meaning that a record may never be processed or processed multiple times.

Here's is where the paginated_update method is called. It's used by the assign_uuids method, updating the key_value table, as well as migrating visualizations. The later is likely the only ones of concern given that the migration changes the visualization type which is part of the query filter and thus it continually skips over eligible records. Thankfully these migrations are actually only optional as the legacy visualization type is not deprecated. A future migration will be necessary regardless when the legacy visualization types are actually deprecated.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Tested locally.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

yield obj
session.merge(obj)

while True:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option was to use either query.slice(...) or session.execute(query).fetchmany(...). I opted for the later because:

  1. Otherwise one would need to include an order_by(...) condition and there's no guarantee that it would be defined.
  2. Re-executing the query n times using a different OFFSET per query is likely neither efficient nor guarantees correct pagination given that the filter condition could change.

end = min(start + batch_size, count)
for obj in query[start:end]:
yield obj
session.merge(obj)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to merge the record. The caller should handle this if required.

@@ -66,7 +66,6 @@ def upgrade():
state["anchor"] = state["hash"]
del state["hash"]
entry.value = pickle.dumps(value)
session.commit()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to commit as the paginated_update handles it.

@@ -87,5 +86,3 @@ def downgrade():
state["hash"] = state["anchor"]
del state["anchor"]
entry.value = pickle.dumps(value)
session.merge(entry)
session.commit()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to commit as the paginated_update handles it.

@@ -87,5 +86,3 @@ def downgrade():
state["hash"] = state["anchor"]
del state["anchor"]
entry.value = pickle.dumps(value)
session.merge(entry)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to merge the existing entry. See here for details:

Used when you may have more than 1 in-memory objects which map to the same database record with some key.

@john-bodley john-bodley force-pushed the john-bodley--fix-paginated_update branch from 1cd04bd to f8f44a5 Compare October 12, 2022 02:37
@john-bodley john-bodley marked this pull request as ready for review October 12, 2022 02:40
@john-bodley john-bodley requested a review from a team as a code owner October 12, 2022 02:40
@codecov
Copy link

codecov bot commented Oct 12, 2022

Codecov Report

Merging #21778 (f8f44a5) into master (bd3166b) will decrease coverage by 0.00%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #21778      +/-   ##
==========================================
- Coverage   66.88%   66.87%   -0.01%     
==========================================
  Files        1802     1802              
  Lines       68987    68988       +1     
  Branches     7345     7345              
==========================================
  Hits        46139    46139              
- Misses      20951    20952       +1     
  Partials     1897     1897              
Flag Coverage Δ
hive 52.92% <0.00%> (-0.01%) ⬇️
mysql 78.25% <0.00%> (-0.01%) ⬇️
postgres 78.31% <0.00%> (-0.01%) ⬇️
presto 52.82% <0.00%> (-0.01%) ⬇️
python 81.45% <0.00%> (-0.01%) ⬇️
sqlite 76.80% <0.00%> (-0.01%) ⬇️
unit 51.05% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/migrations/shared/utils.py 35.48% <0.00%> (-0.59%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱 Thanks for fixing this nasty bug! It must took a lot of time to debug.

Your comment reminded me, what if an object was updated in the loop in a way that makes it no longer matches the query's filtering condition---using the original offset after previous batch was committed would mean some rows will be inevitably skipped.... which might be a bigger problem. This paginated update should really be used with caution. Maybe before each iteration to each batch, we should check whether the total count of the filtering query changes.

@john-bodley
Copy link
Member Author

john-bodley commented Oct 12, 2022

@ktmud

Your comment reminded me, what if an object was updated in the loop in a way that makes it no longer matches the query's filtering condition

Now that the query only executes once this isn't a problem. The result set in paginated rather than the query being sliced. This ensures the logic is indeed safe.

Note my default records from MySQL et al. may be ordered consistently. Your comment actually highlights the issue where the migration mutates the record resulting in a different result set per iteration. This regression likely only impacted the visualization migrations which are actually optional.

@john-bodley john-bodley merged commit 88a89c9 into apache:master Oct 13, 2022
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.1.0 and removed 🚢 2.1.3 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/M 🚢 2.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants