perf(alembic): paginize db migration for new dataset models #19406

ktmud · 2022-03-29T09:34:39Z

SUMMARY

We run into some scalability issues with the db migration script for SIP-68 (PR #17543). For context, we have more than 165k datasets, 1.9 million columns and 345k metrics. Loading all of them in memory and convert them to the new tables in one giant commit, like current implementation does, is impractical. It'd kill the Python process if not the db connection.

This PR tries to optimize this migration script by:

Adding pagination: instead of migrate all datasets & columns in one commit, I added a manual pagination to fetch datasets 100 at a time. SQLA streaming with yield_per may also be an option but it's more complicated when read and write are interlaced with each other---SQLA will complain iter_next is impossible because entities have changed.
Committing changes in small batches: use autocommit_block() to move write operations for each dataset out of the per-migration transaction so new entities are written to the database faster. This avoids keeping all loaded and created entities in memory, either on the Python side or in db buffer. I'm placing the autocommit block around each dataset instead of each page---i.e., writing the dataset and all related columns for each SqlTable in one transaction instead of writing all datasets and all columns from one page of SqlTables in one transaction, as this is tested to be the fastest configuration. Also tested different page sizes and 20 seemed to be working the best. 2,000 datasets can be converted in under 3 minutes. The average memory usage is less than 350MB while writing is in progress.
Use eager loading: added lazy="selectin" to enable SELECT IN eager loading that pulls related data in one SQL statement instead of three.

After this optimization, the migration for our 165k datasets took about 7 hours to finish. Still slow, but better than not able to finish at all. Ideally, in the future, for large full-table migrations like this, the script should be written with raw SQL as much as possible for each dialect we support.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Tested locally with our large internal database.

Added following filters here to test db records generated for specific data tables:

          .filter(
              or_(
                  SqlaTable.table_name == "my_table_name",
                  SqlaTable.sql.like("%my_table_name%"),
              )
          )

Verified values of created entities in MySQL shell via selected SELECT queries:

 SELECT `schema`, `name`, `is_physical` from sl_datasets limit 10;
 SELECT count(*) from sl_columns;

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

ktmud · 2022-03-29T09:35:08Z

superset/migrations/versions/b8d3a24d9131_new_dataset_models.py

+                    is_physical_table
+                    and (column.expression is None or column.expression == "")
+                ),
+                type=column.type or "Unknown",


ktmud · 2022-03-29T09:35:17Z

superset/migrations/versions/b8d3a24d9131_new_dataset_models.py

+                is_spatial=False,
+                is_temporal=False,
+                type="Unknown",  # figuring this out would require a type inferrer
+                warning_text=metric.warning_text,


ktmud · 2022-03-29T09:37:24Z

superset/migrations/versions/b8d3a24d9131_new_dataset_models.py

-    session = inspect(target).session
+    session: Session = inspect(target).session
+    database_id = target.database_id
+    is_physical_table = not target.sql


BaseDatasource checks whether a table is physical/virtual by checking whether table.sql is falsy. I'm changing all target.sql is None in this script to keep it consistent with current behavior.

ktmud · 2022-03-29T09:39:54Z

superset/migrations/versions/b8d3a24d9131_new_dataset_models.py

        predicate = or_(
            *[
                and_(
+                    NewTable.database_id == database_id,


Add database_id enforcement as all three together (db + schema + table name) forms a unique key.

We also need to update superset/connectors/sqla/models.py, where the original logic lives (it had to be copied here so that this migration would still work in the future).

ktmud · 2022-03-29T09:47:45Z

superset/migrations/versions/b8d3a24d9131_new_dataset_models.py

+        batch_op.create_unique_constraint("uq_sl_datasets_uuid", ["uuid"])
+        batch_op.create_unique_constraint(
+            "uq_sl_datasets_sqlatable_id", ["sqlatable_id"]
+        )


Move constraints to the end to improve readability.

serenajiang

👏

github-actions · 2022-03-29T17:29:59Z

⚠️ @ktmud Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

ktmud · 2022-03-29T17:57:43Z

superset/migrations/versions/b8d3a24d9131_new_dataset_models.py

@@ -373,65 +366,35 @@ def upgrade():
        # ExtraJSONMixin
        sa.Column("extra_json", sa.Text(), nullable=True),
        # ImportExportMixin
-        sa.Column("uuid", UUIDType(binary=True), primary_key=False, default=uuid4),
+        sa.Column(
+            "uuid", UUIDType(binary=True), primary_key=False, default=uuid4, unique=True


I'm moving unique key and foreign key constraints inline.

github-actions · 2022-03-30T05:34:34Z

⚠️ @ktmud Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

github-actions · 2022-03-31T00:27:38Z

⚠️ @ktmud Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

ktmud · 2022-03-31T20:05:14Z

Close in favor of #19421

ktmud requested a review from a team as a code owner March 29, 2022 09:34

pull-request-size bot added the size/L label Mar 29, 2022

ktmud force-pushed the new-dataset-model-db-migration-optimzation branch 2 times, most recently from bc69072 to 2196255 Compare March 29, 2022 09:47

ktmud commented Mar 29, 2022

View reviewed changes

serenajiang approved these changes Mar 29, 2022

View reviewed changes

perf(alembic): paginize db migration for new dataset models

c26d174

ktmud force-pushed the new-dataset-model-db-migration-optimzation branch from 2196255 to 3a6cee0 Compare March 29, 2022 17:45

Refactor table creation

bd18390

ktmud force-pushed the new-dataset-model-db-migration-optimzation branch from 3a6cee0 to bd18390 Compare March 29, 2022 17:48

Move foreign key constraints inline

6bf07f3

ktmud commented Mar 29, 2022

View reviewed changes

betodealmeida mentioned this pull request Mar 29, 2022

perf: improve perf in SIP-68 migration #19416

Merged

9 tasks

ktmud mentioned this pull request Mar 29, 2022

perf: refactor SIP-68 db migrations with INSERT SELECT FROM #19421

Merged

9 tasks

villebro added lts-v1 and removed lts-v1 labels Apr 4, 2022

ktmud closed this Apr 4, 2022

john-bodley deleted the new-dataset-model-db-migration-optimzation branch June 8, 2022 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(alembic): paginize db migration for new dataset models #19406

perf(alembic): paginize db migration for new dataset models #19406

ktmud commented Mar 29, 2022 •

edited

Loading

ktmud Mar 29, 2022

ktmud Mar 29, 2022

ktmud Mar 29, 2022

ktmud Mar 29, 2022

betodealmeida Mar 29, 2022

ktmud Mar 29, 2022

serenajiang left a comment

github-actions bot commented Mar 29, 2022

ktmud Mar 29, 2022

github-actions bot commented Mar 30, 2022

github-actions bot commented Mar 31, 2022

ktmud commented Mar 31, 2022

perf(alembic): paginize db migration for new dataset models #19406

perf(alembic): paginize db migration for new dataset models #19406

Conversation

ktmud commented Mar 29, 2022 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

ktmud Mar 29, 2022

Choose a reason for hiding this comment

ktmud Mar 29, 2022

Choose a reason for hiding this comment

ktmud Mar 29, 2022

Choose a reason for hiding this comment

ktmud Mar 29, 2022

Choose a reason for hiding this comment

betodealmeida Mar 29, 2022

Choose a reason for hiding this comment

ktmud Mar 29, 2022

Choose a reason for hiding this comment

serenajiang left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 29, 2022

ktmud Mar 29, 2022

Choose a reason for hiding this comment

github-actions bot commented Mar 30, 2022

github-actions bot commented Mar 31, 2022

ktmud commented Mar 31, 2022

ktmud commented Mar 29, 2022 •

edited

Loading