Correctly update routers after shrink by jtfmumm · Pull Request #2018 · WallarooLabs/wally

jtfmumm · 2018-02-01T09:53:38Z

[This replaces #2003 since it includes Nisan's tests]

After shrink to fit, not all routers were being
correctly updated to point to the new state steps.
This was not caught immediately because we weren't
failing when there was no route found at a partition
router, which is also fixed in this commit.

Closes #2000

jtfmumm · 2018-02-01T09:55:49Z

@SeanTAllen approved my changes which were part of #2003. This PR also includes Nisan's ongoing work on autoscale tests.

SeanTAllen · 2018-02-08T14:43:19Z

So this comes back to the problem where we are apparently unregistering more than once?

SeanTAllen · 2018-02-08T14:43:53Z

@jtfmumm this seems a little messed up, there's abot 15 commits that are listed as part of this.

jtfmumm · 2018-02-08T16:26:31Z

@SeanTAllen Sometimes when running the automated test with the large partition that Invariant was being violated, but not every time. I never saw it violated when running the large partition grow cycles manually. I wouldn't say it helps explain much here, though it's possible we're periodically seeing something weird that manifests as calling dispose() on the same route more than once.

jtfmumm · 2018-02-08T16:27:42Z

@SeanTAllen There's the big commit with the major fix, my 4 independent fixes, and then a bunch of commits from @nisanharamati that might need to be squashed.

After shrink to fit, not all routers were being correctly updated to point to the new state steps. This was not caught immediately because we weren't failing when there was no route found at a partition router, which is also fixed in this commit. Closes #2000

When growing and shrinking, we appear to be violating some Invariants about not unregistering the same producer twice when we're dealing with large partitions being migrated. This does not appear to be causing an actual problem, so I'm opening an issue and commenting out the Invariants for now.

I need more partitions for testing partition rebalancing in autoscale conditions, so I increased the partition count for the alphabet test apps from 27 to 667 by changing the "letter" from a letter to a 2-letter key and creating a partition for each pair in [aa, ab, ac,..., zz] + [!]

nisanharamati · 2018-02-08T23:35:07Z

@jtfmumm the four basic tests (grow by 1, grow by many, shrink by 1, shrink by many) are enabled with a single cycle (and longer timeouts).
They all pass locally for me, so I want to see how they do in CI.
The rest of the tests are disabled right now. This should be good to merge if CI passes and the changes you made pass review (I can review them tomorrow, or maybe @SeanTAllen can)

jtfmumm · 2018-02-09T14:16:31Z

@nisanharamati These autoscale tests are passing for me locally, both automated and manually. But they continue to fail on CI. I pushed a change to make the timeout 200 and it's still failing.

SeanTAllen · 2018-02-09T14:24:47Z

@jtfmumm so its a timing problem based on available resources, yes?

jtfmumm · 2018-02-09T16:08:05Z

@SeanTAllen It seems like it

- Update integration.py - Add `start_from` option to RunnerChecker - Add `tell()` method to `Runner` to get current STDOUT position - Add is_alive and poll methods to Runner class - Reorganize error classes - Add MetricsParserError - Refactor integration's metrics parser - Add multicycle tests for autoscale - Add cluster_shrinker to test dependencies and custom path - remove metrics based validation and related parsing code - replace it with partition based testing - remove `time.sleep`'s requierd for metrics to accummulate - make sender timing easier to control - Improve test error logging - Add a dedicated output formatting function to `integration` - Add a logging setup function to `integration` - Add `INFO2` log level to integration logging and use it for logging runner command strings - Reorganize error messages and flow in `_autoscale_tests.py` - First thing that is printed is the actual error stack trace - Followed optionally by the output of each of the workers - Followed by a report of any workers that may have exited with a non-zero return code, and the last 5 logs lines from each. - Followed by a report of where in the autoscale sequence the test had failed. - Followed by a reprint of the original error message (without the traceback and worker logs) - Followed the STDERR output from logs created using python's `logging` module if any were present, including the command strings each runner used for its subprocess (e.g. the wallaroo/machida/go commands) Note that all of these messages only show up in the case of a test error. Passing tests do not print any output. - comment out most tests, leaving the basic grow by 1, grow by many, shrink by 1, and shrink by many tests with a single cycle enabled. - Fix error in command construction resulting in both `--control` and `--join` addresses being used for joining workers.

jtfmumm added DO NOT MERGE changelog - fixed labels Feb 1, 2018

jtfmumm mentioned this pull request Feb 1, 2018

Correctly update routers after shrink #2003

Closed

jtfmumm requested review from SeanTAllen and nisanharamati and removed request for SeanTAllen February 1, 2018 09:54

nisanharamati force-pushed the issue-1929 branch 2 times, most recently from 7177382 to 5f76329 Compare February 7, 2018 23:54

jtfmumm force-pushed the issue-1929 branch from ffc4d8a to 8a2d037 Compare February 8, 2018 16:23

jtfmumm and others added 7 commits February 8, 2018 13:08

Enable query's unit tests and fix json key quoting

b5211c0

Add test for multi-join, multi-cycle rebalancing

47ff24d

Fix calculation of total current workers during migration

de361a1

Fix calculation of total steps to send during migration

0954ffe

nisanharamati force-pushed the issue-1929 branch 3 times, most recently from 01f000e to 40de639 Compare February 8, 2018 23:32

jtfmumm removed the DO NOT MERGE label Feb 13, 2018

jtfmumm force-pushed the issue-1929 branch from a12bccf to dca4953 Compare February 13, 2018 17:05

jtfmumm merged commit b66d0c1 into master Feb 13, 2018

wallaroolabs-wally added a commit that referenced this pull request Feb 13, 2018

Update CHANGELOG for PR #2018 [skip ci]

003aa8e

jtfmumm deleted the issue-1929 branch February 23, 2018 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Correctly update routers after shrink#2018

Correctly update routers after shrink#2018
jtfmumm merged 8 commits intomasterfrom
issue-1929

jtfmumm commented Feb 1, 2018

Uh oh!

jtfmumm commented Feb 1, 2018

Uh oh!

SeanTAllen commented Feb 8, 2018

Uh oh!

SeanTAllen commented Feb 8, 2018

Uh oh!

jtfmumm commented Feb 8, 2018

Uh oh!

jtfmumm commented Feb 8, 2018

Uh oh!

nisanharamati commented Feb 8, 2018

Uh oh!

jtfmumm commented Feb 9, 2018

Uh oh!

SeanTAllen commented Feb 9, 2018

Uh oh!

jtfmumm commented Feb 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

jtfmumm commented Feb 1, 2018

Uh oh!

jtfmumm commented Feb 1, 2018

Uh oh!

SeanTAllen commented Feb 8, 2018

Uh oh!

SeanTAllen commented Feb 8, 2018

Uh oh!

jtfmumm commented Feb 8, 2018

Uh oh!

jtfmumm commented Feb 8, 2018

Uh oh!

nisanharamati commented Feb 8, 2018

Uh oh!

jtfmumm commented Feb 9, 2018

Uh oh!

SeanTAllen commented Feb 9, 2018

Uh oh!

jtfmumm commented Feb 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants