Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopped read mysql-bin-changelog without any notifications #275

Closed
palexvs opened this issue Oct 18, 2016 · 11 comments
Closed

Stopped read mysql-bin-changelog without any notifications #275

palexvs opened this issue Oct 18, 2016 · 11 comments
Assignees
Labels

Comments

@palexvs
Copy link

palexvs commented Oct 18, 2016

gh-ost: 1.0.21
MySQL 5.7.11 (AWS RDS: master<->slave)
Operation mode: connect to replica, migrate on master

I tried to migrate big table (>200M records) and after 28 hours of migration I got message that everything is ready for switching:

2016-10-18 20:43:16 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 237640472/237640472 100.0%; Applied: 586055; Backlog: 0/100; Time: 28h35m0s(total), 27h32m0s(copy); streamer: mysql-bin-changelog.001124:14196316; State: postponing cut-over; ETA: due
2016-10-18 20:43:17 DEBUG Getting nothing in the write queue. Sleeping...

I compared the original table with the copy and found that they are not identical: >100K records do not exist in the copy.

Log does not show any errors but says that gh-ost was reading mysql-bin-changelog from the same file 001124 from the same position 14196316) for the last 21 hours, so all updates were not copied:

Copy: 69330000/230650433 30.1%; Applied: 585401; Backlog: 4/100; Time: 6h57m0s(total), 6h57m0s(copy); streamer: mysql-bin-changelog.001124:4394961; State: migrating; ETA: 16h10m17s
Copy: 69416000/230650433 30.1%; Applied: 585964; Backlog: 3/100; Time: 6h57m30s(total), 6h57m30s(copy); streamer: mysql-bin-changelog.001124:12757012; State: migrating; ETA: 16h9m44s
Copy: 69526000/230650433 30.1%; Applied: 586055; Backlog: 0/100; Time: 6h58m0s(total), 6h58m0s(copy); streamer: mysql-bin-changelog.001124:14196316; State: migrating; ETA: 16h8m42s
....
Copy: 237640472/237640472 100.0%; Applied: 586055; Backlog: 0/100; Time: 28h40m30s(total), 27h32m0s(copy); streamer: mysql-bin-changelog.001124:14196316; State: postponing cut-over; ETA: due

Command:

./gh-ost \
  --tungsten \
  --assume-master-host='sql01.xxxx.amazonaws.com' \
  --assume-rbr \
  --chunk-size=1000 \
  --nice-ratio=1 \
  --throttle-control-replicas='sql02.xxxx.amazonaws.com' \
  --max-lag-millis=2000 \
  --host='sql02.xxxx.amazonaws.com' \
  --user='root' \
  --password='***' \
  --database='production' \
  --table='t1' \
  --verbose \
  --alter="ADD COLUMN c10 tinyint unsigned" \
  --cut-over=default \
  --allow-master-master \
  --default-retries=1200 \
  --panic-flag-file=/tmp/ghost.panic.flag \
  --postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
  --throttle-additional-flag-file=/tmp/ghost.throttle.flag \
  --initially-drop-ghost-table \
  --debug \
  --stack \
  --execute > gh.log 2>&1
@shlomi-noach
Copy link
Contributor

Log does not show any errors but says that gh-ost was reading mysql-bin-changelog from the same file 001124 from the same position 14196316) for the last 21 hours, so all updates were not copied:

So this was a tungsten cluster, correct?

To complete the story, I assume binary logs did in fact advance on the replica, but gh-ost did not pursue those changes?

Since I'm not a tungsten user I have no immediate insight on this.

Referencing #258 (comment) -- likely unrelated, because gh-ost's reported position does not advance.

I wonder if you could run a --test-on-replica ; this would typically run faster. If this reproduces, maybe we can have a better chance at finding the problem.

@palexvs
Copy link
Author

palexvs commented Oct 21, 2016

So this was a tungsten cluster, correct?

Yes. But I got the same issue twice during migration in 'Connect to master' mode

./gh-ost \
  --assume-rbr \
  --chunk-size=1000 \
  --nice-ratio=1 \
  --throttle-control-replicas='sql02.xxxx.amazonaws.com' \
  --max-lag-millis=2000 \
  --host='sql02.xxxx.amazonaws.com' \
  --user='root' \
  --password='***' \
  --database='production' \
  --table='t1' \
  --verbose \
  --alter="ADD COLUMN c10 tinyint unsigned" \
  --cut-over=default \
  --allow-master-master \
  --allow-on-master \
  --default-retries=120 \
  --panic-flag-file=/tmp/ghost.panic.flag \
  --postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
  --throttle-additional-flag-file=/tmp/ghost.throttle.flag \
  --initially-drop-ghost-table \
  --debug \
  --stack \
  --execute > gh.log 2>&1

To complete the story, I assume binary logs did in fact advance on the replica, but gh-ost did not pursue those changes?

Yes

@palexvs
Copy link
Author

palexvs commented Oct 21, 2016

I found probably something useful in mysql-error-running.log

2016-10-21 09:10:05 513831 [Note] Aborted connection 513831 to db: 'information_schema' user: 'root' host: '10.0.2.104' (Got timeout reading communication packets)
2016-10-21 09:10:05 513833 [Note] Aborted connection 513833 to db: 'production' user: 'root' host: '10.0.2.104' (Got timeout reading communication packets)
2016-10-21 09:54:44 513832 [Note] Aborted connection 513832 to db: 'unconnected' user: 'root' host: '10.0.2.104' (Failed on my_net_write())

10.0.2.104 - server on which I run './gh-ost ...'
I see the same the same value for streamer: and Applied: in gh.log after '2016-10-21 09:54:06'

P.S: There are some MySQL values

wait_timeout: 28800
interactive_timeout: 28800
max_allowed_packet: 4,194,304
net_read_timeout: 30
connect_timeout: 10

@shlomi-noach shlomi-noach self-assigned this Oct 22, 2016
@shlomi-noach
Copy link
Contributor

@palexvs would you mind trying out #286? This fetches the latest go-mysql library, which hopefully solves your issue.

@palexvs
Copy link
Author

palexvs commented Oct 27, 2016

I tried two times and results are the same:
migration is going but streamer: XXX and Applied: XXX are the same after some moment

During the last migration streamer stopped update between 2016-10-27 10:31:17 and 2016-10-27 10:31:47
In MySQL err log I see:

2016-10-27T10:32:38.352336Z 552846 [Note] Aborted connection 552846 to db: 'unconnected' user: 'root' host: '10.0.2.104' (Failed on my_net_write())

In same moment a monitoring shows CPU load increasing from ~1-5% up to 60%
image

Unfortunately, I don't know what is a result and is a reason of the issue

Logs:

Copy: 148278000/234709801 63.2%; Applied: 988960; Backlog: 0/100; Time: 18h20m0s(total), 18h19m52s(copy); streamer: mysql-bin-changelog.122163:36577241; State: migrating; ETA: 10h41m6s
Copy: 148342500/234709801 63.2%; Applied: 989148; Backlog: 0/100; Time: 18h20m30s(total), 18h20m22s(copy); streamer: mysql-bin-changelog.122163:43091131; State: migrating; ETA: 10h40m39s
Copy: 148411000/234709801 63.2%; Applied: 989329; Backlog: 0/100; Time: 18h21m0s(total), 18h20m52s(copy); streamer: mysql-bin-changelog.122163:50117305; State: migrating; ETA: 10h40m8s
Copy: 148477000/234709801 63.3%; Applied: 989532; Backlog: 0/100; Time: 18h21m30s(total), 18h21m22s(copy); streamer: mysql-bin-changelog.122163:56898919; State: migrating; ETA: 10h39m39s
Copy: 148540000/234709801 63.3%; Applied: 989754; Backlog: 0/100; Time: 18h22m0s(total), 18h21m52s(copy); streamer: mysql-bin-changelog.122163:63604291; State: migrating; ETA: 10h39m12s
2016/10/27 10:30:01 binlogsyncer.go:552: [info] rotate to (mysql-bin-changelog.122164, 4)
2016/10/27 10:30:01 binlogsyncer.go:552: [info] rotate to (mysql-bin-changelog.122164, 4)
2016-10-27 10:30:01 INFO rotate to next log name: mysql-bin-changelog.122164
2016-10-27 10:30:01 INFO rotate to next log name: mysql-bin-changelog.122164
Copy: 148604500/234709801 63.3%; Applied: 989966; Backlog: 0/100; Time: 18h22m30s(total), 18h22m22s(copy); streamer: mysql-bin-changelog.122164:3739440; State: migrating; ETA: 10h38m44s
[MySQL] 2016/10/27 10:30:47 statement.go:27: Invalid Connection
Copy: 148664500/234709801 63.3%; Applied: 990216; Backlog: 5/100; Time: 18h23m0s(total), 18h22m52s(copy); streamer: mysql-bin-changelog.122164:9903516; State: migrating; ETA: 10h38m19s
Copy: 148729500/234709801 63.4%; Applied: 990424; Backlog: 0/100; Time: 18h23m30s(total), 18h23m22s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h37m51s
Copy: 148799000/234709801 63.4%; Applied: 990424; Backlog: 0/100; Time: 18h24m0s(total), 18h23m52s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h37m19s
Copy: 148869000/234709801 63.4%; Applied: 990424; Backlog: 0/100; Time: 18h24m30s(total), 18h24m22s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h36m47s
Copy: 148937000/234709801 63.5%; Applied: 990424; Backlog: 0/100; Time: 18h25m0s(total), 18h24m52s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h36m17s
Copy: 149003500/234709801 63.5%; Applied: 990424; Backlog: 0/100; Time: 18h25m30s(total), 18h25m22s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h35m48s
[MySQL] 2016/10/27 10:33:17 statement.go:27: Invalid Connection
Copy: 149069000/234709801 63.5%; Applied: 990424; Backlog: 0/100; Time: 18h26m0s(total), 18h25m52s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h35m19s

@palexvs
Copy link
Author

palexvs commented Feb 27, 2017

@shlomi-noach, #286 - it has not resolved my issue, but #286 + https://github.com/siddontang/go-mysql (master) has resolved it - I have smoothly migrated a few large tables (> 100M records)

I really love your tool and appreciate your work. Thanks!

@shlomi-noach
Copy link
Contributor

@palexvs thank you so much for the information.

FYI, a couple days ago I merged #375, which is merge of up-to-date go-mysql -- I expect this to work for you in exact same way as you solved it -- can you confirm?

@palexvs
Copy link
Author

palexvs commented Feb 27, 2017

Cool. I am going to start migration today and I will write about results in ~4 days

@palexvs
Copy link
Author

palexvs commented Mar 21, 2017

I tried v1.0.35 and I have not got this issue, so close it

P.S: But v1.0.35 lost 1M records from 400M, but it's another story

@palexvs palexvs closed this as completed Mar 21, 2017
@shlomi-noach
Copy link
Contributor

P.S: But v1.0.35 lost 1M records from 400M, but it's another story

WAT 😱

@hhyo
Copy link

hhyo commented Jun 5, 2022

try set ReadTimeout time.Duration

binlogSyncerConfig := replication.BinlogSyncerConfig{
ServerID: serverId,
Flavor: "mysql",
Host: binlogReader.connectionConfig.Key.Hostname,
Port: uint16(binlogReader.connectionConfig.Key.Port),
User: binlogReader.connectionConfig.User,
Password: binlogReader.connectionConfig.Password,
TLSConfig: binlogReader.connectionConfig.TLSConfig(),
UseDecimal: true,
}
binlogReader.binlogSyncer = replication.NewBinlogSyncer(binlogSyncerConfig)

#179

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants