Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work-in-progress for triton-mysql #1

Merged
merged 27 commits into from
Feb 10, 2016
Merged

Work-in-progress for triton-mysql #1

merged 27 commits into from
Feb 10, 2016

Conversation

tgross
Copy link
Contributor

@tgross tgross commented Dec 9, 2015

Instances bootstrap themselves via a Containerbuddy onStart handler. By using the GTID replication available in MySQL 5.7, we can avoid having the replica onStart manually probing inside the primary in
order to find out the binlog position. Instead the DB will use GTIDs to auto-configure this position.

This will work for cases where we're first starting up a cluster -- a replica will go thru as much of the binlog as it needs to catch up on startup. But for starting up instances on an existing cluster with significant data, we'll want to add the ability to migrate a dump of the existing data first.

TODO:

  • migrating data dumps for new replicas on existing clusters
  • a simple test script to use a mysql client to verify replication is working

cc @misterbisson @xer0x for comment

Instances bootstrap themselves via a Containerbuddy `onStart` handler.
Replication is not yet working; going to switch to GTID-based logging
which is better-equipped for setup without manual intervention.
By using the GTID replication available in MySQL 5.7, we can avoid
having the replica `onStart` manually probing inside the primary in
order to find out the binlog position. Instead the DB will use GTIDs
to auto-configure this position.

This will work for cases where we're first starting up a cluster.
For starting up instances on an existing cluster with signficant
data, we'll want to add the ability to migrate a dump of the existing
data first.
links:
- consul:consul

mysql_replica:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is giving us both some pain. Should we attempt to do leader election onStart to eliminate this? I've been avoid that because of risk of race conditions. Should we take it on now, though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing leader "election", we can rely on the fact that docker-compose starts one node first. The decision tree looks like this:

  • during onStart the nodes will ask Consul if there is a primary
    • no? the node will write a key to Consul marking itself as primary. was DB initialized?
      • no? init the DB and start mysqld
      • yes? stop acting as a replica and start mysqld
    • yes? ok, is the primary healthy?
      • no? halt and catch fire
      • yes? set that primary as its source and start mysqld

If we need to promote a replica, we can clear the key in Consul and just restart a replica, then docker exec a command to the replicas to CHANGE MASTER to the new primary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restart a replica, then docker exec a command to the replicas to CHANGE MASTER to the new primary

Would Containerbuddy be able to detect the changed master? Not saying we should, but we could automatically trigger the CHANGE MASTER in that context, yes?

@misterbisson
Copy link
Contributor

Some my.cnf settings we might also consider adding:

Set the number of concurrent threads. CPUs*2 is recommended. I'd bet we could set it to $RAM_in_GBs with a min of 2 and max of 96 and an option to override it.

thread_concurrency = 6

Slightly larger thread and query caches, though the query cache size obviously is very specific to the application.

thread_cache_size = 25
query_cache_size = 32M

I found a bunch of performance improvement with increased temp/memory table size (which helped keep joins from going to disk). Good old http://www.mysqlperformanceblog.com/2007/01/19/tmp_table_size-and-max_heap_table_size/ explains that they're independent and yet dependent.

tmp_table_size = 128M
max_heap_table_size = 128M 

This is specific to MyISAM, which I wouldn't recommend, but are still needed for some purposes (specifically, full text indexes, which aren't supported in InnoDB, last I knew): allow concurrent inserts on all tables, including "dirty" ones, via a tip at http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_concurrent_insert

concurrent_insert = 2

This may now be the default, but using one file for each innodb table is very important. Docs are at http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_file_per_table

innodb_file_per_table = 1

@tgross
Copy link
Contributor Author

tgross commented Dec 10, 2015

CPUs*2 is recommended.

This one comes up a lot because we lie to the OS about how many cores it has. I feel like we need a standardized way to pick it from the environment. Is that even possible?

thread_cache_size = 25

Modern mysql (5.6.8+) can auto-configure by default rather than default this to 0.

query_cache_size = 32M

Ok (default is 1M). We'll also need to set query_cache_type to ON.

http://www.mysqlperformanceblog.com/2007/01/19/tmp_table_size-and-max_heap_table_size/

I'm super-wary about relying on a post that old as gospel for anything from 5.6+, at least as far as details goes. But I'll dig into it.

tmp_table_size = 128M
max_heap_table_size = 128M

Current default tmp_table_size and max_heap_table_size is 16777216, which is 16MB I guess? (No units in the docs.) We can bump these to 128M I guess.

This is specific to MyISAM, which I wouldn't recommend, but are still needed for some purposes (specifically, full text indexes, which aren't supported in InnoDB, last I knew)

Supported in 5.6+ (ref https://dev.mysql.com/doc/refman/5.7/en/innodb-fulltext-index.html). The replication setup we're using here won't support MyISAM, for what it's worth.

This may now be the default, but using one file for each innodb table is very important

It is now the default (ref http://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_file_per_table)

Rather than doing leader "election", we mark what node is primary in Consul
and use that key with CAS to identify whether to init as a primary or setup
replication.
@tgross
Copy link
Contributor Author

tgross commented Dec 10, 2015

Pushed a commit (1fdeef3) that handles the items I think we need to handle from that list.

@misterbisson
Copy link
Contributor

I feel like we need a standardized way to pick it from the environment. Is that even possible?

Maybe. There's some metadata we can read that might have some useful info. If not, it's probably not a huge project to add some useful into that metadata. Noted.

thread_cache_size ... Modern mysql (5.6.8+) can auto-configure by default

At one time, that was probably the single most important performance change a person could configure for WP on MySQL It's good that it's not 0 by default anymore, though I also have to admit that mysqli might no longer benefit from this.

Modern mysql...

Noted. * Strokes beard, thinks evil thoughts *

The replication setup we're using here won't support MyISAM, for what it's worth

Eh. Ok.

tmp_table_size and max_heap_table_size

After sleeping on it, I'm realizing that the defaults are probably just fine for that and my use case was not common enough to deserve adjusting those vars here.

query_cache_size = 32M
query_cache_type = ON
tmp_table_size = 128M
max_heap_table_size = 128M
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep tmp_table_size and max_heap_table_size, they should probably go in the commented out suggestions above. Sorry I made a point of those.

# just via docker-compose

# dataDump() {
# mysqldump -h ${PRIMARY_HOST} -p 3306 --all-databases --master-data --set-gtid-purged=ON > dbdump.db
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may just be me, but my preferred way to do this is to stop MySQL and grab the filesystem. My reasoning for that includes:

  1. The only safe way to get a dump is to lock the tables, which is effectively the same as stopping MySQL.
  2. Dumps and imports take a very long time compared to filesystem copies.
  3. It's a recommended and fully supported solution that doesn't include the pain of the above.

Perhaps you've had experience that has you preferring alternatives, though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was beginning to think that as well.

But here's a radical idea -- what about taking a "container-native" approach? Stop the primary and then docker commit the whole container and bring it up as a new replica. Maybe not possible to do properly with docker-compose but it seems like a really powerful approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop the primary and then docker commit the whole container

I'd been toying with that myself. The biggest challenge was not having support for it, but that's changing. The next problem is that anything on a volume doesn't get committed, which is changing more Docker practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Triton we wouldn't recommend having volumes though, right?

The biggest challenge was not having support for it, but that's changing.

Ah, I'd missed that. Too bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the README now and see what you think. I've glossed over the details of "how do we move the files" for the moment but I think the overall process will work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Triton we wouldn't recommend having volumes though

Well, we wouldn't have a separate data container as a volume, and there's no specific performance advantage to declaring VOLUME /var/lib/mysql in our Dockerfile, but there's still a huge convenience advantage to doing so if we ever mount the MySQL container as a volume in another container, so the two interests are contradictory.

@tgross
Copy link
Contributor Author

tgross commented Dec 11, 2015

thread_concurrency = 6

Turns out this setting only ever did anything on Solaris 8 and Oracle deprecated it on 5.6 and removed it in 5.7. (ref https://blogs.oracle.com/supportingmysql/entry/remove_on_sight_thread_concurrency)

@misterbisson
Copy link
Contributor

Turns out this setting only ever did anything on Solaris 8

That's embarrassing. That means I have to figure out a new explanation for problems and their solution ten years ago, after which I'd been setting that value very carefully.

Separately, innodb_thread_concurrency is a thing, along with some others. Here, have a 2006 blog post explaining it. I know you love those.


A primary that has rotated the binlog or simply has a large binlog will be impractical to use to bootstrap replication without copying data first. In this case we're going to [copy the MySQL data directory](https://dev.mysql.com/doc/refman/5.7/en/replication-gtids-failover.html) to the new replica's file system.

In order to safely snapshot MySQL, we need to prevent new writes. In order to avoid downtime for the application, we recommend using one of the other replicas as a source for the data directory. The process that's been automated here is as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re #1 (comment):

Take a look at the README now and see what you think

I think you're on to something, so let me raise you one:

Rather than dual purposing one of the replicas (which has some arguable advantages), perhaps we should consider have a third mode for our MySQL image: backup instance. The backup instance would be a replica, but never announce itself as such, instead, it would snapshot itself hourly (or at some other interval), and possibly keep n previous snapshots.

Whether this approach includes a separate service in the Compose yaml or if the the instance can auto-detect and elect to become a backup host is uncertain.

Actually, if the backup service is auto-elected from among the many MySQL instances, rather than being a separate Compose service, maybe it's OK to have it also be a read replica as well (though it would appear and disappear from service regularly, which would be annoying).

Write the binlog filename to Consul whenever there's a snapshot. On health
check, if the binlog file name changes from what's in Consul then it's been
rotated and we do a new snapshot.

This commit also combines the behaviors of the primary and standby into a
single behavior, but provides an optional override via USE_STANDBY env var.

### Failover via `onChange` handler

*(Note: this is all TODO and the lock semantic won't quite work like this the way the code is now)*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a little tricky. There's a health heartbeat for mysql-primary already, but that API doesn't have a locking semantic. I think we'll need to use Consul's sessions API to make this work -- those can have both locks and TTLs. Nevermind, the KV API does have a locking option. Should be easy enough to make work.

(First pass-thru of this.)

The key we use to mark the primary in Consul is now locked by a session
with TTL. The primary updates this TTL with each pass thru the `health`
handler. If the primary becomes unhealthy, the replica(s) will try to
get this lock in their `onChange` handler. The winner will become the
new replica and the old replicas will update their replication config
to point to it.
- `MANTA_SUBUSER`: the Manta subuser account name, if any.
- `MANTA_ROLE`: the Manta role name, if any.
- `MANTA_KEY_ID`: the MD5-format ssh key id for the Manta account/subuser (ex. `1a:b8:30:2e:57:ce:59:1d:16:f6:19:97:f2:60:2b:3d`).
- `MANTA_PRIVATE_KEY`: the private ssh key for the Manta account/subuser.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual key, or the path to the key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual key, alas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

export MANTA_PRIVATE_KEY=`cat ~/.ssh/id_rsa`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my _env file:

# Environment variables for MySQL service
MYSQL_USER=someuser
MYSQL_PASSWORD=someuser
MYSQL_DATABASE=someuser
MYSQL_REPL_USER=anotheruser
MYSQL_REPL_PASSWORD= anotheruser

# Environment variables for backups to Manta
MANTA_BUCKET=/myuser/stor/triton-mysql
MANTA_USER=myuser
MANTA_KEY_ID=SHA256:some_sha
MANTA_URL=https://us-east.manta.joyent.com

It doesn't have a MANTA_PRIVATE_KEY. Instead, I'm setting that in my shell environment with:

export MANTA_PRIVATE_KEY=`cat ~/.ssh/id_rsa`

WIth that set, and Docker properly configured (with a current version), it's just

docker-compose up -d

And then…

docker-compose scale mysql=3

to scale it up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgross we probably need to explain MANTA_PRIVATE_KEY in the README and blog post.

When the MySQLNode instance is instantiated it doesn't know if the node
is primary or not. In the on_change handler we check to see if the node
is primary but did not first update the node from Consul so the check
always failed and the primary would execute the on_change behaviors we
expect from the replicas. This would be harmless but races with health
handler behaviors during the initial snapshot.
@tgross
Copy link
Contributor Author

tgross commented Feb 10, 2016

Still some work to be done but merging to master ahead of ContainerSummit.

tgross added a commit that referenced this pull request Feb 10, 2016
Work-in-progress for triton-mysql
@tgross tgross merged commit 524a3d4 into master Feb 10, 2016
@misterbisson misterbisson deleted the working branch June 9, 2016 22:49
tgross pushed a commit that referenced this pull request Sep 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants