Use gh-ost to add time_spent column to user_levels on production #36110

sureshc · 2020-07-31T18:34:27Z

Add time_spent column to production database user_levels table using gh-ost, that is being added via standard ALTER TABLE DDL SQL on all other environments via a Rails migration (#36082). With >2B rows and a very high rate of concurrent reads/writes to the table, ALTER TABLE would either fail (due to consuming space in the alter log) or negatively impact performance of the cluster.

We no longer operate a replica, so using Option B identified in this guide.

Will noted there is guidance on using gh-ost with Aurora.

Deployment strategy

Temporarily disable Aurora binlog filtering - Temporarily disable Aurora binlog filtering for gh-ost #36116
Verify and possibly update the version of gh-ost deployed to production-console (TODO: install via Chef).
Run gh-ost in a screen
Monitor performance 6AM PDT Monday morning as usage increases.
When creation of ghost table is complete, cutover during low usage time period
When new table is confirmed to be working correctly, delete old table during low usage time period.
Re-enable Aurora binlog filter

Testing story

Reviewer Checklist:

Tests provide adequate coverage
Code is well-commented
New features are translatable or updates will not break translations
Relevant documentation has been added or updated
User impact is well-understood and desirable
Pull Request is labeled appropriately
Follow-up work items (including potential tech debt) are tracked and linked

…ls id column as a starting point.

…t operate a replica anymore.

sureshc · 2020-07-31T18:48:50Z

Will noted there is a specific guidance for using gh-ost on Aurora.

jmkulwik · 2020-07-31T20:11:12Z

bin/oneoff/gh-ost_migrations/add_time_spent_column_to_user_levels.sh

+  --assume-rbr
+
+  # amount of rows to handle in each iteration (allowed range: 100-100,000)
+  --chunk-size=1000


Without knowing much about this, it seems like this chunk size should be larger.

I believe this is the number of rows at a time it will copy over from the real user_levels table to the new table where it has applied the schema change. We can adjust it at runtime, if needed, after we’ve started the migration if progress is poor.

jmkulwik · 2020-07-31T20:12:25Z

bin/oneoff/gh-ost_migrations/add_time_spent_column_to_user_levels.sh

+  --panic-flag-file=/tmp/gh-ost.panic.flag
+
+  # actually execute the alter & migrate the table. Default is noop: do some tests and exit
+#  --execute


Guessing this comment is temporary?

Yup! gh-ost can be executed in a no-op mode, so we’ll run that first to make sure paths and config are correct, then manually un-comment, and run the utility for real.

jmkulwik · 2020-07-31T20:13:34Z

Everything looks good to me. I'd prefer other folks sign off however.

jmkulwik · 2020-07-31T21:08:42Z

bin/oneoff/gh-ost_migrations/add_time_spent_column_to_user_levels.sh

+  --verbose
+
+  # alter statement (mandatory)
+  --alter="ADD time_spent int"


Looks good!

sureshc · 2020-08-04T23:01:11Z

It's not clear whether it would be safe to use gh-ost again in the future for this type of migration, particularly during the US school year, given the impact this migration had on database performance.

Write latency on the database increased from ~0.4ms to ~13ms while the gh-ost migration was running. This was possibly due to the gh-ost utility reading the binary logs (which is known to degrade Aurora write performance) along with the gh-ost utility carrying out bulk INSERTs into the new table.

CPU utilization on the Aurora WRITER instance increased to ~35% during the migration

Innodb row lock waits and row lock time also increased significantly during the migration

wjordan · 2020-08-04T23:15:21Z

This was possibly due to the gh-ost utility reading the binary logs (which is known to degrade Aurora write performance) along with the gh-ost utility carrying out bulk INSERTs into the new table.

These two changes occurred at distinct times, correct? Can we correlate the performance-impact change to one of these specific times to rule out the possibility of either/both of them causing the impact?

sureshc · 2020-08-10T17:45:08Z

These two changes occurred at distinct times, correct? Can we correlate the performance-impact change to one of these specific times to rule out the possibility of either/both of them causing the impact?

While we enabled binary logging the day before executing the gh-ost migration, there weren't any replication clients reading from the binary log until we started executing the gh-ost migration, which also started writing intensively to the new table at the same time. I can see that when the copy of the user_levels table to the new table completed, that write latency improved to ~1ms. At that point, gh-ost was still reading from the binary logs, but carrying out fewer writes (just the new/updated user_levels rows. Write latency improved back to our baseline of ~0.4ms after the gh-ost operation completed.

In retrospect, I should have upgraded the Aurora engine to 2.08.0 first. I had dismissed the binary logging performance improvements in that release because they only apply to large transactions and our application mostly executes small, single row, single statement transactions. I forgot that gh-ost is likely creating large transactions as it inserts data in large chunks into the new table.

sureshc · 2020-08-31T18:00:12Z

I upgraded a clone of the production cluster to Aurora Engine 2.08.2, which includes a performance improvement for binary logging when there are large transactions. Gh-ost commits somewhat large transactions as it copies data over in chunks with large insert statements (this Pull Request uses a batch size of 1000 rows), so it’s possible that the new engine version would help. I tried running the same gh-ost migration on the upgraded clone and it exhibited the same degradation in performance (write latency increased from about 0.4ms to ~13ms). It appears that our options to use gh-ost to carry out schema changes on large tables will continue to be limited to time periods when usage is low, such as during winter break or summer vacation.

sureshc added 2 commits July 30, 2020 15:40

Use a copy of the script that converted the datatype of the user_leve…

4801572

…ls id column as a starting point.

Configure migration to run directly on primary cluster, since we don'…

5d6ab2e

…t operate a replica anymore.

sureshc requested review from wjordan, jmkulwik and bencodeorg July 31, 2020 18:34

jmkulwik reviewed Jul 31, 2020

View reviewed changes

sureshc marked this pull request as ready for review July 31, 2020 21:48

wjordan approved these changes Jul 31, 2020

View reviewed changes

sureshc mentioned this pull request Jul 31, 2020

Temporarily disable Aurora binlog filtering for gh-ost #36116

Merged

7 tasks

jmkulwik mentioned this pull request Jul 31, 2020

Pass time spent from StudioApp to user_level #36121

Merged

7 tasks

Make it executable.

ad94d83

sureshc mentioned this pull request Aug 2, 2020

disable Aurora binary logging" #36124

Merged

This was referenced Aug 19, 2020

Time spent to progress redux #36344

Closed

Pass recorded time_spent to the teacher section progress redux store #36351

Merged

sureshc merged commit b255ce5 into staging Aug 31, 2020

sureshc deleted the add_time_spent_to_user_levels_with_gh_ost branch August 31, 2020 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use gh-ost to add time_spent column to user_levels on production #36110

Use gh-ost to add time_spent column to user_levels on production #36110

sureshc commented Jul 31, 2020 •

edited

Loading

sureshc commented Jul 31, 2020

jmkulwik Jul 31, 2020

sureshc Jul 31, 2020

jmkulwik Jul 31, 2020

sureshc Jul 31, 2020 •

edited

Loading

jmkulwik commented Jul 31, 2020

jmkulwik Jul 31, 2020

sureshc commented Aug 4, 2020

wjordan commented Aug 4, 2020

sureshc commented Aug 10, 2020

sureshc commented Aug 31, 2020

Use gh-ost to add time_spent column to user_levels on production #36110

Use gh-ost to add time_spent column to user_levels on production #36110

Conversation

sureshc commented Jul 31, 2020 • edited Loading

Deployment strategy

Testing story

Reviewer Checklist:

sureshc commented Jul 31, 2020

jmkulwik Jul 31, 2020

Choose a reason for hiding this comment

sureshc Jul 31, 2020

Choose a reason for hiding this comment

jmkulwik Jul 31, 2020

Choose a reason for hiding this comment

sureshc Jul 31, 2020 • edited Loading

Choose a reason for hiding this comment

jmkulwik commented Jul 31, 2020

jmkulwik Jul 31, 2020

Choose a reason for hiding this comment

sureshc commented Aug 4, 2020

wjordan commented Aug 4, 2020

sureshc commented Aug 10, 2020

sureshc commented Aug 31, 2020

sureshc commented Jul 31, 2020 •

edited

Loading

sureshc Jul 31, 2020 •

edited

Loading