-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use gh-ost to add time_spent column to user_levels on production #36110
Conversation
…ls id column as a starting point.
…t operate a replica anymore.
Will noted there is a specific guidance for using gh-ost on Aurora. |
--assume-rbr | ||
|
||
# amount of rows to handle in each iteration (allowed range: 100-100,000) | ||
--chunk-size=1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without knowing much about this, it seems like this chunk size should be larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the number of rows at a time it will copy over from the real user_levels
table to the new table where it has applied the schema change. We can adjust it at runtime, if needed, after we’ve started the migration if progress is poor.
--panic-flag-file=/tmp/gh-ost.panic.flag | ||
|
||
# actually execute the alter & migrate the table. Default is noop: do some tests and exit | ||
# --execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing this comment is temporary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! gh-ost can be executed in a no-op mode, so we’ll run that first to make sure paths and config are correct, then manually un-comment, and run the utility for real.
Everything looks good to me. I'd prefer other folks sign off however. |
--verbose | ||
|
||
# alter statement (mandatory) | ||
--alter="ADD time_spent int" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
It's not clear whether it would be safe to use gh-ost again in the future for this type of migration, particularly during the US school year, given the impact this migration had on database performance. Write latency on the database increased from ~0.4ms to ~13ms while the gh-ost migration was running. This was possibly due to the gh-ost utility reading the binary logs (which is known to degrade Aurora write performance) along with the gh-ost utility carrying out bulk INSERTs into the new table. CPU utilization on the Aurora WRITER instance increased to ~35% during the migration Innodb row lock waits and row lock time also increased significantly during the migration |
These two changes occurred at distinct times, correct? Can we correlate the performance-impact change to one of these specific times to rule out the possibility of either/both of them causing the impact? |
While we enabled binary logging the day before executing the gh-ost migration, there weren't any replication clients reading from the binary log until we started executing the gh-ost migration, which also started writing intensively to the new table at the same time. I can see that when the copy of the In retrospect, I should have upgraded the Aurora engine to 2.08.0 first. I had dismissed the binary logging performance improvements in that release because they only apply to large transactions and our application mostly executes small, single row, single statement transactions. I forgot that gh-ost is likely creating large transactions as it inserts data in large chunks into the new table. |
I upgraded a clone of the production cluster to Aurora Engine 2.08.2, which includes a performance improvement for binary logging when there are large transactions. Gh-ost commits somewhat large transactions as it copies data over in chunks with large insert statements (this Pull Request uses a batch size of 1000 rows), so it’s possible that the new engine version would help. I tried running the same gh-ost migration on the upgraded clone and it exhibited the same degradation in performance (write latency increased from about 0.4ms to ~13ms). It appears that our options to use gh-ost to carry out schema changes on large tables will continue to be limited to time periods when usage is low, such as during winter break or summer vacation. |
Add
time_spent
column to production databaseuser_levels
table using gh-ost, that is being added via standard ALTER TABLE DDL SQL on all other environments via a Rails migration (#36082). With >2B rows and a very high rate of concurrent reads/writes to the table, ALTER TABLE would either fail (due to consuming space in the alter log) or negatively impact performance of the cluster.We no longer operate a replica, so using Option B identified in this guide.
Will noted there is guidance on using gh-ost with Aurora.
Deployment strategy
production-console
(TODO: install via Chef).Testing story
Reviewer Checklist: