Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use gh-ost to add time_spent column to user_levels on production #36110

Merged
merged 3 commits into from
Aug 31, 2020

Conversation

sureshc
Copy link
Contributor

@sureshc sureshc commented Jul 31, 2020

Add time_spent column to production database user_levels table using gh-ost, that is being added via standard ALTER TABLE DDL SQL on all other environments via a Rails migration (#36082). With >2B rows and a very high rate of concurrent reads/writes to the table, ALTER TABLE would either fail (due to consuming space in the alter log) or negatively impact performance of the cluster.

We no longer operate a replica, so using Option B identified in this guide.

Will noted there is guidance on using gh-ost with Aurora.

Deployment strategy

  1. Temporarily disable Aurora binlog filtering - Temporarily disable Aurora binlog filtering for gh-ost #36116
  2. Verify and possibly update the version of gh-ost deployed to production-console (TODO: install via Chef).
  3. Run gh-ost in a screen
  4. Monitor performance 6AM PDT Monday morning as usage increases.
  5. When creation of ghost table is complete, cutover during low usage time period
  6. When new table is confirmed to be working correctly, delete old table during low usage time period.
  7. Re-enable Aurora binlog filter

Testing story

Reviewer Checklist:

  • Tests provide adequate coverage
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked

@sureshc
Copy link
Contributor Author

sureshc commented Jul 31, 2020

Will noted there is a specific guidance for using gh-ost on Aurora.

--assume-rbr

# amount of rows to handle in each iteration (allowed range: 100-100,000)
--chunk-size=1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without knowing much about this, it seems like this chunk size should be larger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the number of rows at a time it will copy over from the real user_levels table to the new table where it has applied the schema change. We can adjust it at runtime, if needed, after we’ve started the migration if progress is poor.

--panic-flag-file=/tmp/gh-ost.panic.flag

# actually execute the alter & migrate the table. Default is noop: do some tests and exit
# --execute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing this comment is temporary?

Copy link
Contributor Author

@sureshc sureshc Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! gh-ost can be executed in a no-op mode, so we’ll run that first to make sure paths and config are correct, then manually un-comment, and run the utility for real.

@jmkulwik
Copy link
Contributor

Everything looks good to me. I'd prefer other folks sign off however.

--verbose

# alter statement (mandatory)
--alter="ADD time_spent int"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@sureshc sureshc marked this pull request as ready for review July 31, 2020 21:48
@sureshc
Copy link
Contributor Author

sureshc commented Aug 4, 2020

It's not clear whether it would be safe to use gh-ost again in the future for this type of migration, particularly during the US school year, given the impact this migration had on database performance.

Write latency on the database increased from ~0.4ms to ~13ms while the gh-ost migration was running. This was possibly due to the gh-ost utility reading the binary logs (which is known to degrade Aurora write performance) along with the gh-ost utility carrying out bulk INSERTs into the new table.

image

CPU utilization on the Aurora WRITER instance increased to ~35% during the migration

image

Innodb row lock waits and row lock time also increased significantly during the migration

image

@wjordan
Copy link
Contributor

wjordan commented Aug 4, 2020

This was possibly due to the gh-ost utility reading the binary logs (which is known to degrade Aurora write performance) along with the gh-ost utility carrying out bulk INSERTs into the new table.

These two changes occurred at distinct times, correct? Can we correlate the performance-impact change to one of these specific times to rule out the possibility of either/both of them causing the impact?

@sureshc
Copy link
Contributor Author

sureshc commented Aug 10, 2020

These two changes occurred at distinct times, correct? Can we correlate the performance-impact change to one of these specific times to rule out the possibility of either/both of them causing the impact?

While we enabled binary logging the day before executing the gh-ost migration, there weren't any replication clients reading from the binary log until we started executing the gh-ost migration, which also started writing intensively to the new table at the same time. I can see that when the copy of the user_levels table to the new table completed, that write latency improved to ~1ms. At that point, gh-ost was still reading from the binary logs, but carrying out fewer writes (just the new/updated user_levels rows. Write latency improved back to our baseline of ~0.4ms after the gh-ost operation completed.

image

In retrospect, I should have upgraded the Aurora engine to 2.08.0 first. I had dismissed the binary logging performance improvements in that release because they only apply to large transactions and our application mostly executes small, single row, single statement transactions. I forgot that gh-ost is likely creating large transactions as it inserts data in large chunks into the new table.

@sureshc
Copy link
Contributor Author

sureshc commented Aug 31, 2020

I upgraded a clone of the production cluster to Aurora Engine 2.08.2, which includes a performance improvement for binary logging when there are large transactions. Gh-ost commits somewhat large transactions as it copies data over in chunks with large insert statements (this Pull Request uses a batch size of 1000 rows), so it’s possible that the new engine version would help. I tried running the same gh-ost migration on the upgraded clone and it exhibited the same degradation in performance (write latency increased from about 0.4ms to ~13ms). It appears that our options to use gh-ost to carry out schema changes on large tables will continue to be limited to time periods when usage is low, such as during winter break or summer vacation.

@sureshc sureshc merged commit b255ce5 into staging Aug 31, 2020
@sureshc sureshc deleted the add_time_spent_to_user_levels_with_gh_ost branch August 31, 2020 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants