Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-ost stuck throttling on "value out of range" error #827

Closed
grantholly opened this issue Mar 4, 2020 · 6 comments
Closed

gh-ost stuck throttling on "value out of range" error #827

grantholly opened this issue Mar 4, 2020 · 6 comments
Assignees

Comments

@grantholly
Copy link

grantholly commented Mar 4, 2020

ghost version:

./gh-ost --version
1.0.49

ghost command:

./gh-ost \
  --debug \
  --stack \
  --max-load=Threads_running=100 \
  --user="gh-ost" \
  --password=mypass \
  --host=mydb \
  --port=3306 \
  --database="dirac" \
  --table="remote_archives" \
  --verbose \
  --alter="add hidden tinyint(2) default null" \
  --initially-drop-ghost-table \
  --initially-drop-old-table \
  --timestamp-old-table \
  --assume-rbr \
  --chunk-size=1000 \
  --throttle-additional-flag-file=/tmp/gh-ost-test-wozniak.throttle.flag \
  --serve-socket-file=/tmp/gh-ost-test-wozniak.test.sock \
  --panic-flag-file=/tmp/gh-ost-test-wozniak.panic.flag \
  --postpone-cut-over-flag-file=/tmp/ghost-test-wozniak.postpone.flag \
  --allow-on-master \
  --execute

mysql version:

NOTE: this is an aurora MySQL instance

mysql> show global variables like 'version%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| version                 | 5.7.12-log                   |
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86_64                       |
| version_compile_os      | Linux                        |
+-------------------------+------------------------------+
4 rows in set (0.07 sec)

table definition:

mysql> show create table remote_archives\G
*************************** 1. row ***************************
       Table: remote_archives
Create Table: CREATE TABLE `remote_archives` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `aaa` int(11) NOT NULL,
  `bbb` varchar(255) DEFAULT NULL,
  `ccc` varchar(16) DEFAULT NULL,
  `ddd` datetime NOT NULL,
  ... some other stuff
  PRIMARY KEY (`id`),
  UNIQUE KEY `id` (`id`),
  ... some other stuff here too
  ENGINE=InnoDB AUTO_INCREMENT=4023 DEFAULT CHARSET=utf8
1 row in set (0.08 sec)

problem:

This gh-ost run will forever be throttled and never copy any rows. Each reported line from the output contains that same error message, however, the reported value changes slightly:

State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547932") to a int64: value out of range;

Interestingly, this worked when I excluded the --execute flag, and a percona blog post mentioned using gh-ost on aurora MySQL (https://www.percona.com/blog/2018/06/07/using-gh-ost-with-amazon-aurora-for-mysql/) so I was initially very optimistic.

The error message seems to be mentioning a column named "Value", but my table does not have that column. However, I did find the value column on the _remote_archives_ghc table.
Also of note, this database is not receiving any writes during this migration, and the table only has ~3300 rows in it. Not a big deal right? With the database being so small, I went looking in the remote_archives table to find any values that equaled the reported 18446744073709547932 and could not find it in any of the rows on the table

The gh-ost process doesn't ever crash, but it also never makes any progress copying rows. It looks like this error message is being passed through gh-ost. I couldn't find anything in the code base for "value out of range" for example. It also looks like some sort of parsing problem. Perhaps #681 and #157 could be related. I'm just guessing though.

I should also mention that this same migration worked flawlessly on MySQL 5.7 in our datacenter as well as with our MySQL Vitess cluster.

Right now I am totally blocked on making this migration.

error output with --debug --stack --verbose

2020-03-04 11:56:21 INFO starting gh-ost 1.0.49
2020-03-04 11:56:21 INFO Migrating `dirac`.`remote_archives`
2020-03-04 11:56:21 INFO connection validated on test-boring-wozniak-dirac:3306
2020-03-04 11:56:22 INFO User has REPLICATION CLIENT, REPLICATION SLAVE privileges, and has ALL privileges on `dirac`.*
2020-03-04 11:56:22 INFO binary logs validated on test-boring-wozniak-dirac:3306
2020-03-04 11:56:22 INFO Inspector initiated on ip-172-31-1-98:3306, version 5.7.12-log
2020-03-04 11:56:22 INFO Table found. Engine=InnoDB
2020-03-04 11:56:22 DEBUG Estimated number of rows via STATUS: 3307
2020-03-04 11:56:22 DEBUG Validated no foreign keys exist on table
2020-03-04 11:56:22 DEBUG Validated no triggers exist on table
2020-03-04 11:56:22 INFO Estimated number of rows via EXPLAIN: 3307
2020-03-04 11:56:22 DEBUG Potential unique keys in remote_archives: [PRIMARY (auto_increment): [id]; has nullable: false id (auto_increment): [id]; has nullable: false remote_archives_on_s3_key: [s3_key]; has nullable: false]
2020-03-04 11:56:22 INFO Recursively searching for replication master
2020-03-04 11:56:22 DEBUG Looking for master on test-boring-wozniak-dirac:3306
2020-03-04 11:56:23 INFO Master found to be ip-172-31-1-98:3306
2020-03-04 11:56:23 INFO log_slave_updates validated on test-boring-wozniak-dirac:3306
2020-03-04 11:56:23 INFO connection validated on test-boring-wozniak-dirac:3306
2020-03-04 11:56:23 DEBUG Streamer binlog coordinates: mysql-bin-changelog.000005:847732
2020-03-04 11:56:23 INFO Connecting binlog streamer at mysql-bin-changelog.000005:847732
[2020/03/04 11:56:23] [info] binlogsyncer.go:133 create BinlogSyncer with config {99999 mysql test-boring-wozniak-dirac 3306 gh-ost    false false <nil> false UTC true 0 0s 0s 0 false}
[2020/03/04 11:56:23] [info] binlogsyncer.go:354 begin to sync binlog from position (mysql-bin-changelog.000005, 847732)
[2020/03/04 11:56:23] [info] binlogsyncer.go:203 register slave for master server test-boring-wozniak-dirac:3306
2020-03-04 11:56:24 DEBUG Beginning streaming
2020-03-04 11:56:24 INFO rotate to next log from mysql-bin-changelog.000005:0 to mysql-bin-changelog.000005
[2020/03/04 11:56:24] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.000005, 847732)
2020-03-04 11:56:24 INFO connection validated on test-boring-wozniak-dirac:3306
2020-03-04 11:56:24 INFO connection validated on test-boring-wozniak-dirac:3306
2020-03-04 11:56:25 INFO will use time_zone='SYSTEM' on applier
2020-03-04 11:56:25 INFO Examining table structure on applier
2020-03-04 11:56:25 INFO Applier initiated on ip-172-31-1-98:3306, version 5.7.12-log
2020-03-04 11:56:25 INFO Dropping table `dirac`.`_remote_archives_gho`
2020-03-04 11:56:25 INFO Table dropped
2020-03-04 11:56:25 INFO Dropping table `dirac`.`_remote_archives_20200304115621_del`
2020-03-04 11:56:25 INFO Table dropped
2020-03-04 11:56:25 INFO Dropping table `dirac`.`_remote_archives_ghc`
2020-03-04 11:56:25 INFO Table dropped
2020-03-04 11:56:25 INFO Creating changelog table `dirac`.`_remote_archives_ghc`
2020-03-04 11:56:25 INFO Changelog table created
2020-03-04 11:56:25 INFO Creating ghost table `dirac`.`_remote_archives_gho`
2020-03-04 11:56:25 INFO Ghost table created
2020-03-04 11:56:25 INFO Altering ghost table `dirac`.`_remote_archives_gho`
2020-03-04 11:56:25 DEBUG ALTER statement: alter /* gh-ost */ table `dirac`.`_remote_archives_gho` add encrypted tinyint(2) default null
2020-03-04 11:56:25 INFO Ghost table altered
2020-03-04 11:56:25 INFO Intercepted changelog state GhostTableMigrated
2020-03-04 11:56:26 INFO Created postpone-cut-over-flag-file: /tmp/ghost-test-wozniak.postpone.flag
2020-03-04 11:56:26 INFO Waiting for ghost table to be migrated. Current lag is 0s
2020-03-04 11:56:26 INFO Handled changelog state GhostTableMigrated
2020-03-04 11:56:26 DEBUG ghost table migrated
2020-03-04 11:56:26 DEBUG Potential unique keys in _remote_archives_gho: [PRIMARY (auto_increment): [id]; has nullable: false id (auto_increment): [id]; has nullable: false remote_archives_on_s3_key: [s3_key]; has nullable: false]
2020-03-04 11:56:26 INFO Chosen shared unique key is PRIMARY
2020-03-04 11:56:26 INFO Shared columns are id,account_id,event_type,retention_policy,created_at,updated_at,purged_at,begin_time,s3_key,etag,insert_worker,query_worker,size,version,end_time,corrupt,event_count
2020-03-04 11:56:26 INFO Listening on unix socket file: /tmp/gh-ost-test-wozniak.test.sock
2020-03-04 11:56:26 DEBUG Reading migration range according to key: PRIMARY
2020-03-04 11:56:27 INFO Migration min values: [79]
2020-03-04 11:56:27 DEBUG Reading migration range according to key: PRIMARY
2020-03-04 11:56:27 INFO Migration max values: [4017]
2020-03-04 11:56:27 INFO Waiting for first throttle metrics to be collected
2020-03-04 11:56:27 INFO First throttle metrics collected
2020-03-04 11:56:27 DEBUG Getting nothing in the write queue. Sleeping...
# Migrating `dirac`.`remote_archives`; Ghost table is `dirac`.`_remote_archives_gho`
2020-03-04 11:56:27 DEBUG Operating until row copy is complete
# Migrating ip-172-31-1-98:3306; inspecting ip-172-31-1-98:3306; executing on C02XF1WAJG5M
# Migration started at Wed Mar 04 11:56:21 -0800 2020
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=100; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost-test-wozniak.throttle.flag
# postpone-cut-over-flag-file: /tmp/ghost-test-wozniak.postpone.flag [set]
# panic-flag-file: /tmp/gh-ost-test-wozniak.panic.flag
# Serving on unix socket: /tmp/gh-ost-test-wozniak.test.sock
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 6s(total), 0s(copy); streamer: mysql-bin-changelog.000005:854312; Lag: 0.09s, State: migrating; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 7s(total), 1s(copy); streamer: mysql-bin-changelog.000005:860140; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547932") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 8s(total), 2s(copy); streamer: mysql-bin-changelog.000005:865912; Lag: 0.10s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 9s(total), 3s(copy); streamer: mysql-bin-changelog.000005:870543; Lag: 0.12s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 10s(total), 4s(copy); streamer: mysql-bin-changelog.000005:876314; Lag: 0.08s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547929") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 11s(total), 5s(copy); streamer: mysql-bin-changelog.000005:882084; Lag: 0.08s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547932") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 12s(total), 6s(copy); streamer: mysql-bin-changelog.000005:887849; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 13s(total), 7s(copy); streamer: mysql-bin-changelog.000005:893619; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547934") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 14s(total), 8s(copy); streamer: mysql-bin-changelog.000005:899391; Lag: 0.08s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 15s(total), 9s(copy); streamer: mysql-bin-changelog.000005:905162; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547934") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 16s(total), 10s(copy); streamer: mysql-bin-changelog.000005:910934; Lag: 0.08s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 17s(total), 11s(copy); streamer: mysql-bin-changelog.000005:915568; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 18s(total), 12s(copy); streamer: mysql-bin-changelog.000005:920205; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 19s(total), 13s(copy); streamer: mysql-bin-changelog.000005:925977; Lag: 0.10s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547934") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 20s(total), 14s(copy); streamer: mysql-bin-changelog.000005:931749; Lag: 0.08s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 21s(total), 15s(copy); streamer: mysql-bin-changelog.000005:937523; Lag: 0.08s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547932") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 22s(total), 16s(copy); streamer: mysql-bin-changelog.000005:943297; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547934") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 23s(total), 17s(copy); streamer: mysql-bin-changelog.000005:949069; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 24s(total), 18s(copy); streamer: mysql-bin-changelog.000005:953704; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 25s(total), 19s(copy); streamer: mysql-bin-changelog.000005:958338; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547933") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 26s(total), 20s(copy); streamer: mysql-bin-changelog.000005:964111; Lag: 0.09s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547932") to a int64: value out of range; ETA: N/A
Copy: 0/3307 0.0%; Applied: 0; Backlog: 0/1000; Time: 27s(total), 21s(copy); streamer: mysql-bin-changelog.000005:968746; Lag: 0.08s, State: throttled, Threads_running sql: Scan error on column index 1, name "Value": converting driver.Value type []uint8 ("18446744073709547932") to a int64: value out of range; ETA: N/A
^C
@jfudally
Copy link

jfudally commented Mar 25, 2020

@grantholly Could you please provide the output for this command:

show status where variable_name = 'threads_running;

At first glance it looks like the internal throttling mechanism is fetching the threads_running metric, but that value seems to be returning 18446744073709547932 for some reason causing the throttler to remain stuck.

@jfudally jfudally self-assigned this Mar 25, 2020
@grantholly
Copy link
Author

Sorry this has taken me so long to get back to. I am finally able to get back to this. Here is the output from your suggested query

mysql> show status where variable_name = 'threads_running';
+-----------------+----------------------+
| Variable_name   | Value                |
+-----------------+----------------------+
| Threads_running | 18446744073709551593 |
+-----------------+----------------------+
1 row in set (0.07 sec)

yeah, that's a lot of threads.

@jfudally
Copy link

@grantholly So what's happening is gh-ost is querying the threads_running metric, but 18446744073709551593 is larger than the max value for a 64bit integer, so it's hitting an error.

I'm not sure there's any value in getting gh-ost to handle this large of a number. I feel like you've hit a bug in mysql somehow causing this astronomical number to return.

@grantholly
Copy link
Author

This seems to be an issue with our Aurora database. Even without the gh-ost process running it still reports this comically incorrect value for threads running.

We do not see this issue on any of our other mysql databases, including those running with in our Vitess clusters.

¯_(ツ)_/¯

@jfudally
Copy link

Closing this as the issue is with Aurora. Good luck!

@grantholly
Copy link
Author

For posterity, AWS claims this is fixed in Aurora versions >= 2.04.7. From the documentation:

Fixed an issue that resulted in an incorrect value for the Threads_running global status variable.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Updates.2047.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants