Work-in-progress for triton-mysql #1

tgross · 2015-12-09T15:56:50Z

Instances bootstrap themselves via a Containerbuddy onStart handler. By using the GTID replication available in MySQL 5.7, we can avoid having the replica onStart manually probing inside the primary in
order to find out the binlog position. Instead the DB will use GTIDs to auto-configure this position.

This will work for cases where we're first starting up a cluster -- a replica will go thru as much of the binlog as it needs to catch up on startup. But for starting up instances on an existing cluster with significant data, we'll want to add the ability to migrate a dump of the existing data first.

TODO:

migrating data dumps for new replicas on existing clusters
a simple test script to use a mysql client to verify replication is working

cc @misterbisson @xer0x for comment

Instances bootstrap themselves via a Containerbuddy `onStart` handler. Replication is not yet working; going to switch to GTID-based logging which is better-equipped for setup without manual intervention.

By using the GTID replication available in MySQL 5.7, we can avoid having the replica `onStart` manually probing inside the primary in order to find out the binlog position. Instead the DB will use GTIDs to auto-configure this position. This will work for cases where we're first starting up a cluster. For starting up instances on an existing cluster with signficant data, we'll want to add the ability to migrate a dump of the existing data first.

misterbisson · 2015-12-09T16:12:12Z

docker-compose.yml

+    links:
+      - consul:consul
+
+mysql_replica:


This is giving us both some pain. Should we attempt to do leader election onStart to eliminate this? I've been avoid that because of risk of race conditions. Should we take it on now, though?

Rather than doing leader "election", we can rely on the fact that docker-compose starts one node first. The decision tree looks like this:

during onStart the nodes will ask Consul if there is a primary

no? the node will write a key to Consul marking itself as primary. was DB initialized?

no? init the DB and start mysqld

yes? stop acting as a replica and start mysqld

yes? ok, is the primary healthy?

no? halt and catch fire

yes? set that primary as its source and start mysqld

If we need to promote a replica, we can clear the key in Consul and just restart a replica, then docker exec a command to the replicas to CHANGE MASTER to the new primary.

restart a replica, then docker exec a command to the replicas to CHANGE MASTER to the new primary

Would Containerbuddy be able to detect the changed master? Not saying we should, but we could automatically trigger the CHANGE MASTER in that context, yes?

misterbisson · 2015-12-10T05:22:20Z

Some my.cnf settings we might also consider adding:

Set the number of concurrent threads. CPUs*2 is recommended. I'd bet we could set it to $RAM_in_GBs with a min of 2 and max of 96 and an option to override it.

thread_concurrency = 6

Slightly larger thread and query caches, though the query cache size obviously is very specific to the application.

thread_cache_size = 25
query_cache_size = 32M

I found a bunch of performance improvement with increased temp/memory table size (which helped keep joins from going to disk). Good old http://www.mysqlperformanceblog.com/2007/01/19/tmp_table_size-and-max_heap_table_size/ explains that they're independent and yet dependent.

tmp_table_size = 128M
max_heap_table_size = 128M

This is specific to MyISAM, which I wouldn't recommend, but are still needed for some purposes (specifically, full text indexes, which aren't supported in InnoDB, last I knew): allow concurrent inserts on all tables, including "dirty" ones, via a tip at http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_concurrent_insert

concurrent_insert = 2

This may now be the default, but using one file for each innodb table is very important. Docs are at http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_file_per_table

innodb_file_per_table = 1

tgross · 2015-12-10T19:26:21Z

CPUs*2 is recommended.

This one comes up a lot because we lie to the OS about how many cores it has. I feel like we need a standardized way to pick it from the environment. Is that even possible?

thread_cache_size = 25

Modern mysql (5.6.8+) can auto-configure by default rather than default this to 0.

query_cache_size = 32M

Ok (default is 1M). We'll also need to set query_cache_type to ON.

http://www.mysqlperformanceblog.com/2007/01/19/tmp_table_size-and-max_heap_table_size/

I'm super-wary about relying on a post that old as gospel for anything from 5.6+, at least as far as details goes. But I'll dig into it.

tmp_table_size = 128M
max_heap_table_size = 128M

Current default tmp_table_size and max_heap_table_size is 16777216, which is 16MB I guess? (No units in the docs.) We can bump these to 128M I guess.

This is specific to MyISAM, which I wouldn't recommend, but are still needed for some purposes (specifically, full text indexes, which aren't supported in InnoDB, last I knew)

Supported in 5.6+ (ref https://dev.mysql.com/doc/refman/5.7/en/innodb-fulltext-index.html). The replication setup we're using here won't support MyISAM, for what it's worth.

This may now be the default, but using one file for each innodb table is very important

It is now the default (ref http://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_file_per_table)

Rather than doing leader "election", we mark what node is primary in Consul and use that key with CAS to identify whether to init as a primary or setup replication.

tgross · 2015-12-10T21:32:13Z

Pushed a commit (1fdeef3) that handles the items I think we need to handle from that list.

misterbisson · 2015-12-11T05:27:37Z

I feel like we need a standardized way to pick it from the environment. Is that even possible?

Maybe. There's some metadata we can read that might have some useful info. If not, it's probably not a huge project to add some useful into that metadata. Noted.

thread_cache_size ... Modern mysql (5.6.8+) can auto-configure by default

At one time, that was probably the single most important performance change a person could configure for WP on MySQL It's good that it's not 0 by default anymore, though I also have to admit that mysqli might no longer benefit from this.

Modern mysql...

Noted. * Strokes beard, thinks evil thoughts *

The replication setup we're using here won't support MyISAM, for what it's worth

Eh. Ok.

tmp_table_size and max_heap_table_size

After sleeping on it, I'm realizing that the defaults are probably just fine for that and my use case was not common enough to deserve adjusting those vars here.

misterbisson · 2015-12-11T05:28:48Z

etc/my.cnf

+query_cache_size = 32M
+query_cache_type = ON
+tmp_table_size = 128M
+max_heap_table_size = 128M


If we keep tmp_table_size and max_heap_table_size, they should probably go in the commented out suggestions above. Sorry I made a point of those.

misterbisson · 2015-12-11T18:18:33Z

bin/mysql.sh

+# just via docker-compose
+
+# dataDump() {
+#     mysqldump -h ${PRIMARY_HOST} -p 3306 --all-databases --master-data --set-gtid-purged=ON > dbdump.db


This may just be me, but my preferred way to do this is to stop MySQL and grab the filesystem. My reasoning for that includes:

The only safe way to get a dump is to lock the tables, which is effectively the same as stopping MySQL.

Dumps and imports take a very long time compared to filesystem copies.

It's a recommended and fully supported solution that doesn't include the pain of the above.

Perhaps you've had experience that has you preferring alternatives, though?

Yeah, I was beginning to think that as well.

But here's a radical idea -- what about taking a "container-native" approach? Stop the primary and then docker commit the whole container and bring it up as a new replica. Maybe not possible to do properly with docker-compose but it seems like a really powerful approach.

Stop the primary and then docker commit the whole container

I'd been toying with that myself. The biggest challenge was not having support for it, but that's changing. The next problem is that anything on a volume doesn't get committed, which is changing more Docker practice.

On Triton we wouldn't recommend having volumes though, right?

The biggest challenge was not having support for it, but that's changing.

Ah, I'd missed that. Too bad.

Take a look at the README now and see what you think. I've glossed over the details of "how do we move the files" for the moment but I think the overall process will work.

On Triton we wouldn't recommend having volumes though

Well, we wouldn't have a separate data container as a volume, and there's no specific performance advantage to declaring VOLUME /var/lib/mysql in our Dockerfile, but there's still a huge convenience advantage to doing so if we ever mount the MySQL container as a volume in another container, so the two interests are contradictory.

…moved in 5.7

tgross · 2015-12-11T18:35:57Z

thread_concurrency = 6

Turns out this setting only ever did anything on Solaris 8 and Oracle deprecated it on 5.6 and removed it in 5.7. (ref https://blogs.oracle.com/supportingmysql/entry/remove_on_sight_thread_concurrency)

misterbisson · 2015-12-12T07:42:05Z

Turns out this setting only ever did anything on Solaris 8

That's embarrassing. That means I have to figure out a new explanation for problems and their solution ten years ago, after which I'd been setting that value very carefully.

Separately, innodb_thread_concurrency is a thing, along with some others. Here, have a 2006 blog post explaining it. I know you love those.

misterbisson · 2015-12-12T19:30:51Z

README.md

+
+A primary that has rotated the binlog or simply has a large binlog will be impractical to use to bootstrap replication without copying data first. In this case we're going to [copy the MySQL data directory](https://dev.mysql.com/doc/refman/5.7/en/replication-gtids-failover.html) to the new replica's file system.
+
+In order to safely snapshot MySQL, we need to prevent new writes. In order to avoid downtime for the application, we recommend using one of the other replicas as a source for the data directory. The process that's been automated here is as follows:


Re #1 (comment):

Take a look at the README now and see what you think

I think you're on to something, so let me raise you one:

Rather than dual purposing one of the replicas (which has some arguable advantages), perhaps we should consider have a third mode for our MySQL image: backup instance. The backup instance would be a replica, but never announce itself as such, instead, it would snapshot itself hourly (or at some other interval), and possibly keep n previous snapshots.

Whether this approach includes a separate service in the Compose yaml or if the the instance can auto-detect and elect to become a backup host is uncertain.

Actually, if the backup service is auto-elected from among the many MySQL instances, rather than being a separate Compose service, maybe it's OK to have it also be a read replica as well (though it would appear and disappear from service regularly, which would be annoying).

Write the binlog filename to Consul whenever there's a snapshot. On health check, if the binlog file name changes from what's in Consul then it's been rotated and we do a new snapshot. This commit also combines the behaviors of the primary and standby into a single behavior, but provides an optional override via USE_STANDBY env var.

tgross · 2016-01-13T18:58:49Z

README.md

+
+### Failover via `onChange` handler
+
+*(Note: this is all TODO and the lock semantic won't quite work like this the way the code is now)*


This is going to be a little tricky. There's a health heartbeat for mysql-primary already, but that API doesn't have a locking semantic. I think we'll need to use Consul's sessions API to make this work -- those can have both locks and TTLs. Nevermind, the KV API does have a locking option. Should be easy enough to make work.

(First pass-thru of this.) The key we use to mark the primary in Consul is now locked by a session with TTL. The primary updates this TTL with each pass thru the `health` handler. If the primary becomes unhealthy, the replica(s) will try to get this lock in their `onChange` handler. The winner will become the new replica and the old replicas will update their replication config to point to it.

misterbisson · 2016-02-03T17:40:55Z

README.md

+- `MANTA_SUBUSER`: the Manta subuser account name, if any.
+- `MANTA_ROLE`: the Manta role name, if any.
+- `MANTA_KEY_ID`: the MD5-format ssh key id for the Manta account/subuser (ex. `1a:b8:30:2e:57:ce:59:1d:16:f6:19:97:f2:60:2b:3d`).
+- `MANTA_PRIVATE_KEY`: the private ssh key for the Manta account/subuser.


The actual key, or the path to the key?

Actual key, alas.

Something like

export MANTA_PRIVATE_KEY=`cat ~/.ssh/id_rsa`

Here's my _env file:

# Environment variables for MySQL service MYSQL_USER=someuser MYSQL_PASSWORD=someuser MYSQL_DATABASE=someuser MYSQL_REPL_USER=anotheruser MYSQL_REPL_PASSWORD= anotheruser # Environment variables for backups to Manta MANTA_BUCKET=/myuser/stor/triton-mysql MANTA_USER=myuser MANTA_KEY_ID=SHA256:some_sha MANTA_URL=https://us-east.manta.joyent.com

It doesn't have a MANTA_PRIVATE_KEY. Instead, I'm setting that in my shell environment with:

export MANTA_PRIVATE_KEY=`cat ~/.ssh/id_rsa`

WIth that set, and Docker properly configured (with a current version), it's just

docker-compose up -d

And then…

docker-compose scale mysql=3

to scale it up

@tgross we probably need to explain MANTA_PRIVATE_KEY in the README and blog post.

When the MySQLNode instance is instantiated it doesn't know if the node is primary or not. In the on_change handler we check to see if the node is primary but did not first update the node from Consul so the check always failed and the primary would execute the on_change behaviors we expect from the replicas. This would be harmless but races with health handler behaviors during the initial snapshot.

tgross · 2016-02-10T13:18:57Z

Still some work to be done but merging to master ahead of ContainerSummit.

Work-in-progress for triton-mysql

tgross added 2 commits December 9, 2015 09:54

Initial setup of mysqld instances w/ docker-compose only

d4437f9

Instances bootstrap themselves via a Containerbuddy `onStart` handler. Replication is not yet working; going to switch to GTID-based logging which is better-equipped for setup without manual intervention.

misterbisson reviewed Dec 9, 2015
View reviewed changes

tgross force-pushed the working branch from 4444b9c to a387369 Compare December 10, 2015 20:27

Get rid of separate mysql vs mysql-replica stanzas in docker-compose

2c4e127

Rather than doing leader "election", we mark what node is primary in Consul and use that key with CAS to identify whether to init as a primary or setup replication.

tgross force-pushed the working branch from a387369 to 2c4e127 Compare December 10, 2015 20:39

MySQL defaults tuning

1fdeef3

misterbisson reviewed Dec 11, 2015
View reviewed changes

Update README description of replication architecture

6347bcf

misterbisson reviewed Dec 11, 2015
View reviewed changes

Remove thread_concurrency: this is a Solaris-only setting that was re…

96004d6

…moved in 5.7

Draft README update describing replication for existing cluster

607a54f

misterbisson reviewed Dec 12, 2015
View reviewed changes

tgross force-pushed the working branch from a85ef03 to 1ffcb8c Compare December 16, 2015 15:08

Reimplementation in Python

f45beb5

tgross force-pushed the working branch from b250dd4 to f45beb5 Compare December 17, 2015 20:19

This was referenced Jan 11, 2016

'Transfer-Encoding: chunked' support TritonDataCenter/python-manta#6

Closed

Reload signal can cause freeze? TritonDataCenter/containerpilot#73

Closed

tgross added 4 commits January 12, 2016 17:13

Makefile cleanup: remove standby target, python-manta builder

d8060c5

Rework README to current architecture. Cleanup extraneous envvars

abf78ed

Remove commented-out section of Dockerfile

4dad8b4

tgross reviewed Jan 13, 2016
View reviewed changes

tgross force-pushed the working branch from f171ae9 to 2d6d8ae Compare January 14, 2016 13:00

tgross added 6 commits January 26, 2016 10:04

Update Containerbuddy to RC 0.1.1

8bb2e15

Move creation of snapshot into subprocess and exit

2857bbe

Work-in-progress for demo UI

4f40ecf

Cleaned up display and error handling in demo UI

96bdbb7

Added color display to demo UI

f44f473

Clean up environment variable sources for demo

e973cc0

tgross force-pushed the working branch from 14535c5 to e973cc0 Compare February 1, 2016 21:10

Reworked session locking after some failover testing

3be08ae

tgross force-pushed the working branch from b7fd3c6 to 3be08ae Compare February 2, 2016 20:20

misterbisson reviewed Feb 3, 2016
View reviewed changes

tgross force-pushed the working branch from bb5d71b to 7406706 Compare February 5, 2016 16:42

Move from curses UI to web UI for demonstration

95d69b5

tgross force-pushed the working branch from 7406706 to 95d69b5 Compare February 5, 2016 19:12

tgross added 2 commits February 5, 2016 14:13

Colorize status in UI, make log sections hideable

c5696a8

tgross added a commit that referenced this pull request Feb 10, 2016

Merge pull request #1 from tgross/working

524a3d4

Work-in-progress for triton-mysql

tgross merged commit 524a3d4 into master Feb 10, 2016

misterbisson deleted the working branch June 9, 2016 22:49

tgross pushed a commit that referenced this pull request Sep 15, 2016

Inject links to ContainerPilot docs (#1)

72bee30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work-in-progress for triton-mysql #1

Work-in-progress for triton-mysql #1

tgross commented Dec 9, 2015

misterbisson Dec 9, 2015

tgross Dec 9, 2015

misterbisson Dec 10, 2015

misterbisson commented Dec 10, 2015

tgross commented Dec 10, 2015

tgross commented Dec 10, 2015

misterbisson commented Dec 11, 2015

misterbisson Dec 11, 2015

misterbisson Dec 11, 2015

tgross Dec 11, 2015

misterbisson Dec 11, 2015

tgross Dec 11, 2015

tgross Dec 11, 2015

misterbisson Dec 12, 2015

tgross commented Dec 11, 2015

misterbisson commented Dec 12, 2015

misterbisson Dec 12, 2015

tgross Jan 13, 2016

misterbisson Feb 3, 2016

tgross Feb 3, 2016

misterbisson Feb 3, 2016

misterbisson Feb 4, 2016

misterbisson Feb 25, 2016

tgross commented Feb 10, 2016


		A primary that has rotated the binlog or simply has a large binlog will be impractical to use to bootstrap replication without copying data first. In this case we're going to [copy the MySQL data directory](https://dev.mysql.com/doc/refman/5.7/en/replication-gtids-failover.html) to the new replica's file system.

		In order to safely snapshot MySQL, we need to prevent new writes. In order to avoid downtime for the application, we recommend using one of the other replicas as a source for the data directory. The process that's been automated here is as follows:


		### Failover via `onChange` handler

		(Note: this is all TODO and the lock semantic won't quite work like this the way the code is now)

Work-in-progress for triton-mysql #1

Work-in-progress for triton-mysql #1

Conversation

tgross commented Dec 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

misterbisson commented Dec 10, 2015

tgross commented Dec 10, 2015

tgross commented Dec 10, 2015

misterbisson commented Dec 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross commented Dec 11, 2015

misterbisson commented Dec 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross commented Feb 10, 2016