Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(alembic): paginize db migration for new dataset models #19406

Closed

Conversation

ktmud
Copy link
Member

@ktmud ktmud commented Mar 29, 2022

SUMMARY

We run into some scalability issues with the db migration script for SIP-68 (PR #17543). For context, we have more than 165k datasets, 1.9 million columns and 345k metrics. Loading all of them in memory and convert them to the new tables in one giant commit, like current implementation does, is impractical. It'd kill the Python process if not the db connection.

This PR tries to optimize this migration script by:

  1. Adding pagination: instead of migrate all datasets & columns in one commit, I added a manual pagination to fetch datasets 100 at a time. SQLA streaming with yield_per may also be an option but it's more complicated when read and write are interlaced with each other---SQLA will complain iter_next is impossible because entities have changed.
  2. Committing changes in small batches: use autocommit_block() to move write operations for each dataset out of the per-migration transaction so new entities are written to the database faster. This avoids keeping all loaded and created entities in memory, either on the Python side or in db buffer. I'm placing the autocommit block around each dataset instead of each page---i.e., writing the dataset and all related columns for each SqlTable in one transaction instead of writing all datasets and all columns from one page of SqlTables in one transaction, as this is tested to be the fastest configuration. Also tested different page sizes and 20 seemed to be working the best. 2,000 datasets can be converted in under 3 minutes. The average memory usage is less than 350MB while writing is in progress.
  3. Use eager loading: added lazy="selectin" to enable SELECT IN eager loading that pulls related data in one SQL statement instead of three.

After this optimization, the migration for our 165k datasets took about 7 hours to finish. Still slow, but better than not able to finish at all. Ideally, in the future, for large full-table migrations like this, the script should be written with raw SQL as much as possible for each dialect we support.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

image

TESTING INSTRUCTIONS

  • Tested locally with our large internal database.
  • Added following filters here to test db records generated for specific data tables:
              .filter(
                  or_(
                      SqlaTable.table_name == "my_table_name",
                      SqlaTable.sql.like("%my_table_name%"),
                  )
              )
  • Verified values of created entities in MySQL shell via selected SELECT queries:
     SELECT `schema`, `name`, `is_physical` from sl_datasets limit 10;
     SELECT count(*) from sl_columns;

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@ktmud ktmud requested a review from a team as a code owner March 29, 2022 09:34
@ktmud ktmud force-pushed the new-dataset-model-db-migration-optimzation branch 2 times, most recently from bc69072 to 2196255 Compare March 29, 2022 09:47
is_physical_table
and (column.expression is None or column.expression == "")
),
type=column.type or "Unknown",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A-Z

is_spatial=False,
is_temporal=False,
type="Unknown", # figuring this out would require a type inferrer
warning_text=metric.warning_text,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto A-Z

session = inspect(target).session
session: Session = inspect(target).session
database_id = target.database_id
is_physical_table = not target.sql
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseDatasource checks whether a table is physical/virtual by checking whether table.sql is falsy. I'm changing all target.sql is None in this script to keep it consistent with current behavior.

predicate = or_(
*[
and_(
NewTable.database_id == database_id,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add database_id enforcement as all three together (db + schema + table name) forms a unique key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to update superset/connectors/sqla/models.py, where the original logic lives (it had to be copied here so that this migration would still work in the future).

batch_op.create_unique_constraint("uq_sl_datasets_uuid", ["uuid"])
batch_op.create_unique_constraint(
"uq_sl_datasets_sqlatable_id", ["sqlatable_id"]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move constraints to the end to improve readability.

Copy link
Contributor

@serenajiang serenajiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@github-actions
Copy link
Contributor

⚠️ @ktmud Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@ktmud ktmud force-pushed the new-dataset-model-db-migration-optimzation branch from 2196255 to 3a6cee0 Compare March 29, 2022 17:45
@ktmud ktmud force-pushed the new-dataset-model-db-migration-optimzation branch from 3a6cee0 to bd18390 Compare March 29, 2022 17:48
@@ -373,65 +366,35 @@ def upgrade():
# ExtraJSONMixin
sa.Column("extra_json", sa.Text(), nullable=True),
# ImportExportMixin
sa.Column("uuid", UUIDType(binary=True), primary_key=False, default=uuid4),
sa.Column(
"uuid", UUIDType(binary=True), primary_key=False, default=uuid4, unique=True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm moving unique key and foreign key constraints inline.

@github-actions
Copy link
Contributor

⚠️ @ktmud Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

1 similar comment
@github-actions
Copy link
Contributor

⚠️ @ktmud Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@ktmud
Copy link
Member Author

ktmud commented Mar 31, 2022

Close in favor of #19421

@villebro villebro added lts-v1 and removed lts-v1 labels Apr 4, 2022
@ktmud ktmud closed this Apr 4, 2022
@john-bodley john-bodley deleted the new-dataset-model-db-migration-optimzation branch June 8, 2022 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants