-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work-in-progress for triton-mysql #1
Conversation
Instances bootstrap themselves via a Containerbuddy `onStart` handler. Replication is not yet working; going to switch to GTID-based logging which is better-equipped for setup without manual intervention.
By using the GTID replication available in MySQL 5.7, we can avoid having the replica `onStart` manually probing inside the primary in order to find out the binlog position. Instead the DB will use GTIDs to auto-configure this position. This will work for cases where we're first starting up a cluster. For starting up instances on an existing cluster with signficant data, we'll want to add the ability to migrate a dump of the existing data first.
links: | ||
- consul:consul | ||
|
||
mysql_replica: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is giving us both some pain. Should we attempt to do leader election onStart
to eliminate this? I've been avoid that because of risk of race conditions. Should we take it on now, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing leader "election", we can rely on the fact that docker-compose
starts one node first. The decision tree looks like this:
- during
onStart
the nodes will ask Consul if there is a primary- no? the node will write a key to Consul marking itself as primary. was DB initialized?
- no? init the DB and start mysqld
- yes? stop acting as a replica and start mysqld
- yes? ok, is the primary healthy?
- no? halt and catch fire
- yes? set that primary as its source and start mysqld
- no? the node will write a key to Consul marking itself as primary. was DB initialized?
If we need to promote a replica, we can clear the key in Consul and just restart a replica, then docker exec
a command to the replicas to CHANGE MASTER
to the new primary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restart a replica, then
docker exec
a command to the replicas toCHANGE MASTER
to the new primary
Would Containerbuddy be able to detect the changed master? Not saying we should, but we could automatically trigger the CHANGE MASTER
in that context, yes?
Some Set the number of concurrent threads. CPUs*2 is recommended. I'd bet we could set it to
Slightly larger thread and query caches, though the query cache size obviously is very specific to the application.
I found a bunch of performance improvement with increased temp/memory table size (which helped keep joins from going to disk). Good old http://www.mysqlperformanceblog.com/2007/01/19/tmp_table_size-and-max_heap_table_size/ explains that they're independent and yet dependent.
This is specific to MyISAM, which I wouldn't recommend, but are still needed for some purposes (specifically, full text indexes, which aren't supported in InnoDB, last I knew): allow concurrent inserts on all tables, including "dirty" ones, via a tip at http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_concurrent_insert
This may now be the default, but using one file for each innodb table is very important. Docs are at http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_file_per_table
|
This one comes up a lot because we lie to the OS about how many cores it has. I feel like we need a standardized way to pick it from the environment. Is that even possible?
Modern mysql (5.6.8+) can auto-configure by default rather than default this to 0.
Ok (default is 1M). We'll also need to set
I'm super-wary about relying on a post that old as gospel for anything from 5.6+, at least as far as details goes. But I'll dig into it.
Current default
Supported in 5.6+ (ref https://dev.mysql.com/doc/refman/5.7/en/innodb-fulltext-index.html). The replication setup we're using here won't support MyISAM, for what it's worth.
It is now the default (ref http://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_file_per_table) |
Rather than doing leader "election", we mark what node is primary in Consul and use that key with CAS to identify whether to init as a primary or setup replication.
Pushed a commit (1fdeef3) that handles the items I think we need to handle from that list. |
Maybe. There's some metadata we can read that might have some useful info. If not, it's probably not a huge project to add some useful into that metadata. Noted.
At one time, that was probably the single most important performance change a person could configure for WP on MySQL It's good that it's not
Noted. * Strokes beard, thinks evil thoughts *
Eh. Ok.
After sleeping on it, I'm realizing that the defaults are probably just fine for that and my use case was not common enough to deserve adjusting those vars here. |
query_cache_size = 32M | ||
query_cache_type = ON | ||
tmp_table_size = 128M | ||
max_heap_table_size = 128M |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep tmp_table_size
and max_heap_table_size
, they should probably go in the commented out suggestions above. Sorry I made a point of those.
# just via docker-compose | ||
|
||
# dataDump() { | ||
# mysqldump -h ${PRIMARY_HOST} -p 3306 --all-databases --master-data --set-gtid-purged=ON > dbdump.db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may just be me, but my preferred way to do this is to stop MySQL and grab the filesystem. My reasoning for that includes:
- The only safe way to get a dump is to lock the tables, which is effectively the same as stopping MySQL.
- Dumps and imports take a very long time compared to filesystem copies.
- It's a recommended and fully supported solution that doesn't include the pain of the above.
Perhaps you've had experience that has you preferring alternatives, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was beginning to think that as well.
But here's a radical idea -- what about taking a "container-native" approach? Stop the primary and then docker commit
the whole container and bring it up as a new replica. Maybe not possible to do properly with docker-compose
but it seems like a really powerful approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop the primary and then docker commit the whole container
I'd been toying with that myself. The biggest challenge was not having support for it, but that's changing. The next problem is that anything on a volume doesn't get committed, which is changing more Docker practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Triton we wouldn't recommend having volumes though, right?
The biggest challenge was not having support for it, but that's changing.
Ah, I'd missed that. Too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the README now and see what you think. I've glossed over the details of "how do we move the files" for the moment but I think the overall process will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Triton we wouldn't recommend having volumes though
Well, we wouldn't have a separate data container as a volume, and there's no specific performance advantage to declaring VOLUME /var/lib/mysql
in our Dockerfile, but there's still a huge convenience advantage to doing so if we ever mount the MySQL container as a volume in another container, so the two interests are contradictory.
Turns out this setting only ever did anything on Solaris 8 and Oracle deprecated it on 5.6 and removed it in 5.7. (ref https://blogs.oracle.com/supportingmysql/entry/remove_on_sight_thread_concurrency) |
That's embarrassing. That means I have to figure out a new explanation for problems and their solution ten years ago, after which I'd been setting that value very carefully. Separately, |
|
||
A primary that has rotated the binlog or simply has a large binlog will be impractical to use to bootstrap replication without copying data first. In this case we're going to [copy the MySQL data directory](https://dev.mysql.com/doc/refman/5.7/en/replication-gtids-failover.html) to the new replica's file system. | ||
|
||
In order to safely snapshot MySQL, we need to prevent new writes. In order to avoid downtime for the application, we recommend using one of the other replicas as a source for the data directory. The process that's been automated here is as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re #1 (comment):
Take a look at the README now and see what you think
I think you're on to something, so let me raise you one:
Rather than dual purposing one of the replicas (which has some arguable advantages), perhaps we should consider have a third mode for our MySQL image: backup instance. The backup instance would be a replica, but never announce itself as such, instead, it would snapshot itself hourly (or at some other interval), and possibly keep n
previous snapshots.
Whether this approach includes a separate service in the Compose yaml or if the the instance can auto-detect and elect to become a backup host is uncertain.
Actually, if the backup service is auto-elected from among the many MySQL instances, rather than being a separate Compose service, maybe it's OK to have it also be a read replica as well (though it would appear and disappear from service regularly, which would be annoying).
Write the binlog filename to Consul whenever there's a snapshot. On health check, if the binlog file name changes from what's in Consul then it's been rotated and we do a new snapshot. This commit also combines the behaviors of the primary and standby into a single behavior, but provides an optional override via USE_STANDBY env var.
|
||
### Failover via `onChange` handler | ||
|
||
*(Note: this is all TODO and the lock semantic won't quite work like this the way the code is now)* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be a little tricky. There's a health heartbeat for Nevermind, the KV API does have a locking option. Should be easy enough to make work.mysql-primary
already, but that API doesn't have a locking semantic. I think we'll need to use Consul's sessions API to make this work -- those can have both locks and TTLs.
(First pass-thru of this.) The key we use to mark the primary in Consul is now locked by a session with TTL. The primary updates this TTL with each pass thru the `health` handler. If the primary becomes unhealthy, the replica(s) will try to get this lock in their `onChange` handler. The winner will become the new replica and the old replicas will update their replication config to point to it.
- `MANTA_SUBUSER`: the Manta subuser account name, if any. | ||
- `MANTA_ROLE`: the Manta role name, if any. | ||
- `MANTA_KEY_ID`: the MD5-format ssh key id for the Manta account/subuser (ex. `1a:b8:30:2e:57:ce:59:1d:16:f6:19:97:f2:60:2b:3d`). | ||
- `MANTA_PRIVATE_KEY`: the private ssh key for the Manta account/subuser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual key, or the path to the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actual key, alas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
export MANTA_PRIVATE_KEY=`cat ~/.ssh/id_rsa`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my _env
file:
# Environment variables for MySQL service
MYSQL_USER=someuser
MYSQL_PASSWORD=someuser
MYSQL_DATABASE=someuser
MYSQL_REPL_USER=anotheruser
MYSQL_REPL_PASSWORD= anotheruser
# Environment variables for backups to Manta
MANTA_BUCKET=/myuser/stor/triton-mysql
MANTA_USER=myuser
MANTA_KEY_ID=SHA256:some_sha
MANTA_URL=https://us-east.manta.joyent.com
It doesn't have a MANTA_PRIVATE_KEY
. Instead, I'm setting that in my shell environment with:
export MANTA_PRIVATE_KEY=`cat ~/.ssh/id_rsa`
WIth that set, and Docker properly configured (with a current version), it's just
docker-compose up -d
And then…
docker-compose scale mysql=3
to scale it up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgross we probably need to explain MANTA_PRIVATE_KEY
in the README and blog post.
When the MySQLNode instance is instantiated it doesn't know if the node is primary or not. In the on_change handler we check to see if the node is primary but did not first update the node from Consul so the check always failed and the primary would execute the on_change behaviors we expect from the replicas. This would be harmless but races with health handler behaviors during the initial snapshot.
Still some work to be done but merging to master ahead of ContainerSummit. |
Work-in-progress for triton-mysql
Instances bootstrap themselves via a Containerbuddy
onStart
handler. By using the GTID replication available in MySQL 5.7, we can avoid having the replicaonStart
manually probing inside the primary inorder to find out the binlog position. Instead the DB will use GTIDs to auto-configure this position.
This will work for cases where we're first starting up a cluster -- a replica will go thru as much of the binlog as it needs to catch up on startup. But for starting up instances on an existing cluster with significant data, we'll want to add the ability to migrate a dump of the existing data first.
TODO:
cc @misterbisson @xer0x for comment