Stopped read mysql-bin-changelog without any notifications #275

palexvs · 2016-10-18T20:57:58Z

gh-ost: 1.0.21
MySQL 5.7.11 (AWS RDS: master<->slave)
Operation mode: connect to replica, migrate on master

I tried to migrate big table (>200M records) and after 28 hours of migration I got message that everything is ready for switching:

2016-10-18 20:43:16 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 237640472/237640472 100.0%; Applied: 586055; Backlog: 0/100; Time: 28h35m0s(total), 27h32m0s(copy); streamer: mysql-bin-changelog.001124:14196316; State: postponing cut-over; ETA: due
2016-10-18 20:43:17 DEBUG Getting nothing in the write queue. Sleeping...

I compared the original table with the copy and found that they are not identical: >100K records do not exist in the copy.

Log does not show any errors but says that gh-ost was reading mysql-bin-changelog from the same file 001124 from the same position 14196316) for the last 21 hours, so all updates were not copied:

Copy: 69330000/230650433 30.1%; Applied: 585401; Backlog: 4/100; Time: 6h57m0s(total), 6h57m0s(copy); streamer: mysql-bin-changelog.001124:4394961; State: migrating; ETA: 16h10m17s
Copy: 69416000/230650433 30.1%; Applied: 585964; Backlog: 3/100; Time: 6h57m30s(total), 6h57m30s(copy); streamer: mysql-bin-changelog.001124:12757012; State: migrating; ETA: 16h9m44s
Copy: 69526000/230650433 30.1%; Applied: 586055; Backlog: 0/100; Time: 6h58m0s(total), 6h58m0s(copy); streamer: mysql-bin-changelog.001124:14196316; State: migrating; ETA: 16h8m42s
....
Copy: 237640472/237640472 100.0%; Applied: 586055; Backlog: 0/100; Time: 28h40m30s(total), 27h32m0s(copy); streamer: mysql-bin-changelog.001124:14196316; State: postponing cut-over; ETA: due

Command:

./gh-ost \
  --tungsten \
  --assume-master-host='sql01.xxxx.amazonaws.com' \
  --assume-rbr \
  --chunk-size=1000 \
  --nice-ratio=1 \
  --throttle-control-replicas='sql02.xxxx.amazonaws.com' \
  --max-lag-millis=2000 \
  --host='sql02.xxxx.amazonaws.com' \
  --user='root' \
  --password='***' \
  --database='production' \
  --table='t1' \
  --verbose \
  --alter="ADD COLUMN c10 tinyint unsigned" \
  --cut-over=default \
  --allow-master-master \
  --default-retries=1200 \
  --panic-flag-file=/tmp/ghost.panic.flag \
  --postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
  --throttle-additional-flag-file=/tmp/ghost.throttle.flag \
  --initially-drop-ghost-table \
  --debug \
  --stack \
  --execute > gh.log 2>&1

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2016-10-19T07:31:27Z

Log does not show any errors but says that gh-ost was reading mysql-bin-changelog from the same file 001124 from the same position 14196316) for the last 21 hours, so all updates were not copied:

So this was a tungsten cluster, correct?

To complete the story, I assume binary logs did in fact advance on the replica, but gh-ost did not pursue those changes?

Since I'm not a tungsten user I have no immediate insight on this.

Referencing #258 (comment) -- likely unrelated, because gh-ost's reported position does not advance.

I wonder if you could run a --test-on-replica ; this would typically run faster. If this reproduces, maybe we can have a better chance at finding the problem.

palexvs · 2016-10-21T18:44:28Z

So this was a tungsten cluster, correct?

Yes. But I got the same issue twice during migration in 'Connect to master' mode

./gh-ost \
  --assume-rbr \
  --chunk-size=1000 \
  --nice-ratio=1 \
  --throttle-control-replicas='sql02.xxxx.amazonaws.com' \
  --max-lag-millis=2000 \
  --host='sql02.xxxx.amazonaws.com' \
  --user='root' \
  --password='***' \
  --database='production' \
  --table='t1' \
  --verbose \
  --alter="ADD COLUMN c10 tinyint unsigned" \
  --cut-over=default \
  --allow-master-master \
  --allow-on-master \
  --default-retries=120 \
  --panic-flag-file=/tmp/ghost.panic.flag \
  --postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
  --throttle-additional-flag-file=/tmp/ghost.throttle.flag \
  --initially-drop-ghost-table \
  --debug \
  --stack \
  --execute > gh.log 2>&1

To complete the story, I assume binary logs did in fact advance on the replica, but gh-ost did not pursue those changes?

Yes

palexvs · 2016-10-21T20:31:12Z

I found probably something useful in mysql-error-running.log

2016-10-21 09:10:05 513831 [Note] Aborted connection 513831 to db: 'information_schema' user: 'root' host: '10.0.2.104' (Got timeout reading communication packets)
2016-10-21 09:10:05 513833 [Note] Aborted connection 513833 to db: 'production' user: 'root' host: '10.0.2.104' (Got timeout reading communication packets)
2016-10-21 09:54:44 513832 [Note] Aborted connection 513832 to db: 'unconnected' user: 'root' host: '10.0.2.104' (Failed on my_net_write())

10.0.2.104 - server on which I run './gh-ost ...'
I see the same the same value for streamer: and Applied: in gh.log after '2016-10-21 09:54:06'

P.S: There are some MySQL values

wait_timeout: 28800
interactive_timeout: 28800
max_allowed_packet: 4,194,304
net_read_timeout: 30
connect_timeout: 10

shlomi-noach · 2016-10-24T12:01:02Z

@palexvs would you mind trying out #286? This fetches the latest go-mysql library, which hopefully solves your issue.

palexvs · 2016-10-27T16:25:00Z

I tried two times and results are the same:
migration is going but streamer: XXX and Applied: XXX are the same after some moment

During the last migration streamer stopped update between 2016-10-27 10:31:17 and 2016-10-27 10:31:47
In MySQL err log I see:

2016-10-27T10:32:38.352336Z 552846 [Note] Aborted connection 552846 to db: 'unconnected' user: 'root' host: '10.0.2.104' (Failed on my_net_write())

In same moment a monitoring shows CPU load increasing from ~1-5% up to 60%

Unfortunately, I don't know what is a result and is a reason of the issue

Logs:

Copy: 148278000/234709801 63.2%; Applied: 988960; Backlog: 0/100; Time: 18h20m0s(total), 18h19m52s(copy); streamer: mysql-bin-changelog.122163:36577241; State: migrating; ETA: 10h41m6s
Copy: 148342500/234709801 63.2%; Applied: 989148; Backlog: 0/100; Time: 18h20m30s(total), 18h20m22s(copy); streamer: mysql-bin-changelog.122163:43091131; State: migrating; ETA: 10h40m39s
Copy: 148411000/234709801 63.2%; Applied: 989329; Backlog: 0/100; Time: 18h21m0s(total), 18h20m52s(copy); streamer: mysql-bin-changelog.122163:50117305; State: migrating; ETA: 10h40m8s
Copy: 148477000/234709801 63.3%; Applied: 989532; Backlog: 0/100; Time: 18h21m30s(total), 18h21m22s(copy); streamer: mysql-bin-changelog.122163:56898919; State: migrating; ETA: 10h39m39s
Copy: 148540000/234709801 63.3%; Applied: 989754; Backlog: 0/100; Time: 18h22m0s(total), 18h21m52s(copy); streamer: mysql-bin-changelog.122163:63604291; State: migrating; ETA: 10h39m12s
2016/10/27 10:30:01 binlogsyncer.go:552: [info] rotate to (mysql-bin-changelog.122164, 4)
2016/10/27 10:30:01 binlogsyncer.go:552: [info] rotate to (mysql-bin-changelog.122164, 4)
2016-10-27 10:30:01 INFO rotate to next log name: mysql-bin-changelog.122164
2016-10-27 10:30:01 INFO rotate to next log name: mysql-bin-changelog.122164
Copy: 148604500/234709801 63.3%; Applied: 989966; Backlog: 0/100; Time: 18h22m30s(total), 18h22m22s(copy); streamer: mysql-bin-changelog.122164:3739440; State: migrating; ETA: 10h38m44s
[MySQL] 2016/10/27 10:30:47 statement.go:27: Invalid Connection
Copy: 148664500/234709801 63.3%; Applied: 990216; Backlog: 5/100; Time: 18h23m0s(total), 18h22m52s(copy); streamer: mysql-bin-changelog.122164:9903516; State: migrating; ETA: 10h38m19s
Copy: 148729500/234709801 63.4%; Applied: 990424; Backlog: 0/100; Time: 18h23m30s(total), 18h23m22s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h37m51s
Copy: 148799000/234709801 63.4%; Applied: 990424; Backlog: 0/100; Time: 18h24m0s(total), 18h23m52s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h37m19s
Copy: 148869000/234709801 63.4%; Applied: 990424; Backlog: 0/100; Time: 18h24m30s(total), 18h24m22s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h36m47s
Copy: 148937000/234709801 63.5%; Applied: 990424; Backlog: 0/100; Time: 18h25m0s(total), 18h24m52s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h36m17s
Copy: 149003500/234709801 63.5%; Applied: 990424; Backlog: 0/100; Time: 18h25m30s(total), 18h25m22s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h35m48s
[MySQL] 2016/10/27 10:33:17 statement.go:27: Invalid Connection
Copy: 149069000/234709801 63.5%; Applied: 990424; Backlog: 0/100; Time: 18h26m0s(total), 18h25m52s(copy); streamer: mysql-bin-changelog.122164:14533577; State: migrating; ETA: 10h35m19s

palexvs · 2017-02-27T20:20:27Z

@shlomi-noach, #286 - it has not resolved my issue, but #286 + https://github.com/siddontang/go-mysql (master) has resolved it - I have smoothly migrated a few large tables (> 100M records)

I really love your tool and appreciate your work. Thanks!

shlomi-noach · 2017-02-27T20:33:59Z

@palexvs thank you so much for the information.

FYI, a couple days ago I merged #375, which is merge of up-to-date go-mysql -- I expect this to work for you in exact same way as you solved it -- can you confirm?

palexvs · 2017-02-27T20:37:56Z

Cool. I am going to start migration today and I will write about results in ~4 days

palexvs · 2017-03-21T22:58:16Z

I tried v1.0.35 and I have not got this issue, so close it

P.S: But v1.0.35 lost 1M records from 400M, but it's another story

shlomi-noach · 2017-03-22T06:56:36Z

P.S: But v1.0.35 lost 1M records from 400M, but it's another story

WAT 😱

hhyo · 2022-06-05T11:08:44Z

try set ReadTimeout time.Duration

gh-ost/go/binlog/gomysql_reader.go

Lines 43 to 53 in ed46138

    
           binlogSyncerConfig := replication.BinlogSyncerConfig{ 
        
           	ServerID:   serverId, 
        
           	Flavor:     "mysql", 
        
           	Host:       binlogReader.connectionConfig.Key.Hostname, 
        
           	Port:       uint16(binlogReader.connectionConfig.Key.Port), 
        
           	User:       binlogReader.connectionConfig.User, 
        
           	Password:   binlogReader.connectionConfig.Password, 
        
           	TLSConfig:  binlogReader.connectionConfig.TLSConfig(), 
        
           	UseDecimal: true, 
        
           } 
        
           binlogReader.binlogSyncer = replication.NewBinlogSyncer(binlogSyncerConfig)

#179

shlomi-noach self-assigned this Oct 22, 2016

shlomi-noach mentioned this issue Oct 24, 2016

Update go mysql #286

Closed

shlomi-noach added the bug label Nov 1, 2016

shlomi-noach mentioned this issue Feb 27, 2017

StreamEvents encountered unexpected error: Err: runtime error: index out of range #268

Closed

palexvs closed this as completed Mar 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopped read mysql-bin-changelog without any notifications #275

Stopped read mysql-bin-changelog without any notifications #275

palexvs commented Oct 18, 2016

shlomi-noach commented Oct 19, 2016

palexvs commented Oct 21, 2016 •

edited

palexvs commented Oct 21, 2016 •

edited

shlomi-noach commented Oct 24, 2016

palexvs commented Oct 27, 2016 •

edited

palexvs commented Feb 27, 2017

shlomi-noach commented Feb 27, 2017

palexvs commented Feb 27, 2017

palexvs commented Mar 21, 2017

shlomi-noach commented Mar 22, 2017

hhyo commented Jun 5, 2022

Stopped read mysql-bin-changelog without any notifications #275

Stopped read mysql-bin-changelog without any notifications #275

Comments

palexvs commented Oct 18, 2016

shlomi-noach commented Oct 19, 2016

palexvs commented Oct 21, 2016 • edited

palexvs commented Oct 21, 2016 • edited

shlomi-noach commented Oct 24, 2016

palexvs commented Oct 27, 2016 • edited

palexvs commented Feb 27, 2017

shlomi-noach commented Feb 27, 2017

palexvs commented Feb 27, 2017

palexvs commented Mar 21, 2017

shlomi-noach commented Mar 22, 2017

hhyo commented Jun 5, 2022

palexvs commented Oct 21, 2016 •

edited

palexvs commented Oct 21, 2016 •

edited

palexvs commented Oct 27, 2016 •

edited