FR: if --exact-rowcount=true, don't wait for count to return before starting #194

ggunson · 2016-08-23T22:39:02Z

SELECT COUNT(*) FROM table can take a long time with large tables, and is slower on MySQL 5.7 than 5.6, according to this bug report: https://bugs.mysql.com/bug.php?id=80580

Using the --exact-rowcount flag, gh-ost gets a count of the table before it starts copying rows, so that the progress and estimated completion times are more accurate than using table statistics for the calculations.

If the flag is used on a big migration, the migration can be delayed several minutes (we saw upwards of 40 minutes in practice using a 5.7 replica) waiting for the count, and the row count is now 40 minutes old and is no longer accurate on an insert-heavy table.

Is it possible to start the row copy prior to getting the count and check on the result later as it would with the various panic/postpone flag files? It can take a while for gh-ost to display an estimated completion time already, even after it has the count.

/cc @github/database-infrastructure

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2016-08-24T06:20:01Z

If the flag is used on a big migration, the migration can be delayed several minutes (we saw upwards of 40 minutes in practice using a 5.7 replica) waiting for the count, and the row count is now 40 minutes old and is no longer accurate on an insert-heavy table.

That is true, and the number is never really absolutely correct, but experience shows that it is nonetheless very close to the real value; on our largest migrations I've see it off by less than 1%. So I suggest in terms of accuracy we've achieved a good one.

I think the big deal here is "are we willing to pay an extra 30 minutes for the count(*) to complete before migration starts".

Running the count(*) in parallel to running the row-copy and updating the estimate later is entirely possible.

This will have the:

positive effect of reducing runtime and getting some stats early on the migration
confusing effect of changing ETA some 30 minutes after migration started

We can make this optional and the user can choose desired behavior.

shlomi-noach · 2016-08-24T06:41:23Z

Waiting the first 30 minutes of a migration by just waiting for count(*) to complete is also something that's not human friendly -- another reason for allowing count(*) to run concurrently with migration.

dveeden · 2016-08-24T07:10:05Z

On an insert heavy table running the count(*) for 40 minutes might also grow the undo logs in ibdata1. Maybe changing the transaction isolation is useful.

Does it adjust the count when gh-ost sees INSERTs or DELETEs? If so, are these two (count and adjustments) properly aligned? Maybe it should start a transaction (and enforce InnoDB to take part in it) and then get the binlog position and then start the count and the watching-of-binlogs?

ggunson · 2016-08-24T07:10:19Z

the number is never really absolutely correct, but experience shows that it is nonetheless very close to the real value

True; my idea is more about saving that delay at the beginning.

@jonahberquist noted, though, that if the row copying and binlog parsing began at the same time as the select count(*), and the # of rows inserted or deleted (as parsed from the binlog) were added to/subtracted from the total count in the gh-ost status updates, then there would be some ETA and status accuracy improvements there. Even if that's not important or particularly impactful, it's still 🎈 💥 🍰

shlomi-noach · 2016-08-24T10:11:56Z

On an insert heavy table running the count(*) for 40 minutes might also grow the undo logs in ibdata1. Maybe changing the transaction isolation is useful.

@dveeden good idea, thank you

Does it adjust the count when gh-ost sees INSERTs or DELETEs? If so, are these two (count and adjustments) properly aligned? Maybe it should start a transaction (and enforce InnoDB to take part in it) and then get the binlog position and then start the count and the watching-of-binlogs?

Yes, it adjusts the count on insert and delete. Right now the heuristic is simplistic. +1 for every insert and -1 for every delete. There's plan to make that actually take the copied-range into consideration at least with auto_increment primary key migrations.

However these are pretty accurate as they are; I wouldn't invest much effort in those counters.

shlomi-noach · 2016-08-24T10:13:41Z

@jonahberquist noted, though, that if the row copying and binlog parsing began at the same time as the select count(*), and the # of rows inserted or deleted (as parsed from the binlog) were added to/subtracted from the total count in the gh-ost status updates, then there would be some ETA and status accuracy improvements there. Even if that's not important or particularly impactful, it's still

@ggunson that's true.

Work begins at #196

shlomi-noach · 2016-08-24T10:27:55Z

On an insert heavy table running the count(*) for 40 minutes might also grow the undo logs in ibdata1. Maybe changing the transaction isolation is useful.

Noting down that this applies whether count(*) is concurrent or not concurrent.

ggunson · 2016-08-24T20:39:45Z

On an insert heavy table running the count(*) for 40 minutes might also grow the undo logs in ibdata1. Maybe changing the transaction isolation is useful.

I wasn't very scientific in my testing but changing the session isolation level between repeatable read, read committed and read uncommitted didn't make a difference (or much of one) in terms of undo log growth on a 5.7 replica. Which makes sense to me in the context of a read-only select in autocommit mode.

If the select is run on its own (not inside a greater transaction with writes), I don't know if setting the session transaction level to read-committed would be helpful for non-locking, either. But please correct me if I'm wrong about this.

shlomi-noach · 2016-08-25T07:58:17Z

This is what it looks like, in production:

Copy: 0/14923516 0.0%; Applied: 0; Backlog: 0/100; Time: 0s(total), 0s(copy); streamer: mysql-bin.008614:61454014; ETA: N/A
Copy: 0/14923516 0.0%; Applied: 0; Backlog: 63/100; Time: 1s(total), 1s(copy); streamer: mysql-bin.008614:66051562; ETA: N/A
Copy: 15800/14923517 0.1%; Applied: 121; Backlog: 0/100; Time: 2s(total), 2s(copy); streamer: mysql-bin.008614:71840230; ETA: N/A
Copy: 34000/14923517 0.2%; Applied: 164; Backlog: 1/100; Time: 3s(total), 3s(copy); streamer: mysql-bin.008614:76176964; ETA: N/A
Copy: 52500/14923517 0.4%; Applied: 200; Backlog: 0/100; Time: 4s(total), 4s(copy); streamer: mysql-bin.008614:81580997; ETA: N/A
[MySQL] 2016/08/25 00:55:35 statement.go:27: Invalid Connection
Copy: 70400/14923517 0.5%; Applied: 257; Backlog: 0/100; Time: 5s(total), 5s(copy); streamer: mysql-bin.008614:83558818; ETA: N/A
Copy: 87600/14923517 0.6%; Applied: 337; Backlog: 0/100; Time: 6s(total), 6s(copy); streamer: mysql-bin.008614:89712263; ETA: N/A
Copy: 104700/14923517 0.7%; Applied: 382; Backlog: 0/100; Time: 7s(total), 7s(copy); streamer: mysql-bin.008614:96050224; ETA: N/A
Copy: 122000/14923517 0.8%; Applied: 430; Backlog: 2/100; Time: 8s(total), 8s(copy); streamer: mysql-bin.008614:101298746; ETA: N/A
2016-08-25 00:55:38 INFO Exact number of rows via COUNT: 17272102
Copy: 136500/17272103 0.8%; Applied: 477; Backlog: 0/100; Time: 9s(total), 9s(copy); streamer: mysql-bin.008614:106043626; ETA: N/A
[MySQL] 2016/08/25 00:55:40 statement.go:27: Invalid Connection
Copy: 154400/17272103 0.9%; Applied: 528; Backlog: 0/100; Time: 10s(total), 10s(copy); streamer: mysql-bin.008614:110861095; ETA: N/A
Copy: 173100/17272103 1.0%; Applied: 550; Backlog: 0/100; Time: 11s(total), 11s(copy); streamer: mysql-bin.008614:112705946; ETA: 18m6s
Copy: 190000/17272103 1.1%; Applied: 621; Backlog: 0/100; Time: 12s(total), 12s(copy); streamer: mysql-bin.008614:119328963; ETA: 17m58s
Copy: 206400/17272103 1.2%; Applied: 683; Backlog: 0/100; Time: 13s(total), 13s(copy); streamer: mysql-bin.008614:125131188; ETA: 17m54s

Notice how 2016-08-25 00:55:38 INFO Exact number of rows via COUNT: 17272102 kicks in and the estimated number of rows changes.

shlomi-noach · 2016-08-26T06:33:42Z

Closed by #196

shlomi-noach · 2016-08-26T06:38:31Z

released in https://github.com/github/gh-ost/releases/tag/v1.0.14

shlomi-noach added the enhancement label Aug 24, 2016

shlomi-noach self-assigned this Aug 24, 2016

shlomi-noach mentioned this issue Aug 24, 2016

concurrent row-count #196

Merged

4 tasks

shlomi-noach closed this as completed Aug 26, 2016

ggunson mentioned this issue Oct 21, 2016

Make -concurrent-rowcount default to true #285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FR: if --exact-rowcount=true, don't wait for count to return before starting #194

FR: if --exact-rowcount=true, don't wait for count to return before starting #194

ggunson commented Aug 23, 2016

shlomi-noach commented Aug 24, 2016

shlomi-noach commented Aug 24, 2016

dveeden commented Aug 24, 2016

ggunson commented Aug 24, 2016 •

edited

shlomi-noach commented Aug 24, 2016

shlomi-noach commented Aug 24, 2016

shlomi-noach commented Aug 24, 2016

ggunson commented Aug 24, 2016

shlomi-noach commented Aug 25, 2016 •

edited

shlomi-noach commented Aug 26, 2016

shlomi-noach commented Aug 26, 2016

FR: if --exact-rowcount=true, don't wait for count to return before starting #194

FR: if --exact-rowcount=true, don't wait for count to return before starting #194

Comments

ggunson commented Aug 23, 2016

shlomi-noach commented Aug 24, 2016

shlomi-noach commented Aug 24, 2016

dveeden commented Aug 24, 2016

ggunson commented Aug 24, 2016 • edited

shlomi-noach commented Aug 24, 2016

shlomi-noach commented Aug 24, 2016

shlomi-noach commented Aug 24, 2016

ggunson commented Aug 24, 2016

shlomi-noach commented Aug 25, 2016 • edited

shlomi-noach commented Aug 26, 2016

shlomi-noach commented Aug 26, 2016

ggunson commented Aug 24, 2016 •

edited

shlomi-noach commented Aug 25, 2016 •

edited