Skip to content

Comments

Correctly update routers after shrink#2018

Merged
jtfmumm merged 8 commits intomasterfrom
issue-1929
Feb 13, 2018
Merged

Correctly update routers after shrink#2018
jtfmumm merged 8 commits intomasterfrom
issue-1929

Conversation

@jtfmumm
Copy link
Contributor

@jtfmumm jtfmumm commented Feb 1, 2018

[This replaces #2003 since it includes Nisan's tests]

After shrink to fit, not all routers were being
correctly updated to point to the new state steps.
This was not caught immediately because we weren't
failing when there was no route found at a partition
router, which is also fixed in this commit.

Closes #2000

@jtfmumm
Copy link
Contributor Author

jtfmumm commented Feb 1, 2018

@SeanTAllen approved my changes which were part of #2003. This PR also includes Nisan's ongoing work on autoscale tests.

@nisanharamati nisanharamati force-pushed the issue-1929 branch 2 times, most recently from 7177382 to 5f76329 Compare February 7, 2018 23:54
@SeanTAllen
Copy link
Contributor

So this comes back to the problem where we are apparently unregistering more than once?

@SeanTAllen
Copy link
Contributor

@jtfmumm this seems a little messed up, there's abot 15 commits that are listed as part of this.

@jtfmumm
Copy link
Contributor Author

jtfmumm commented Feb 8, 2018

@SeanTAllen Sometimes when running the automated test with the large partition that Invariant was being violated, but not every time. I never saw it violated when running the large partition grow cycles manually. I wouldn't say it helps explain much here, though it's possible we're periodically seeing something weird that manifests as calling dispose() on the same route more than once.

@jtfmumm
Copy link
Contributor Author

jtfmumm commented Feb 8, 2018

@SeanTAllen There's the big commit with the major fix, my 4 independent fixes, and then a bunch of commits from @nisanharamati that might need to be squashed.

jtfmumm and others added 7 commits February 8, 2018 13:08
After shrink to fit, not all routers were being
correctly updated to point to the new state steps.
This was not caught immediately because we weren't
failing when there was no route found at a partition
router, which is also fixed in this commit.

Closes #2000
When growing and shrinking, we appear to be violating
some Invariants about not unregistering the same producer
twice when we're dealing with large partitions being migrated.
This does not appear to be causing an actual problem, so I'm
opening an issue and commenting out the Invariants for now.
I need more partitions for testing partition rebalancing in autoscale
conditions, so I increased the partition count for the alphabet test
apps from 27 to 667 by changing the "letter" from a letter to a 2-letter
key and creating a partition for each pair in [aa, ab, ac,..., zz] + [!]
@nisanharamati nisanharamati force-pushed the issue-1929 branch 3 times, most recently from 01f000e to 40de639 Compare February 8, 2018 23:32
@nisanharamati
Copy link
Contributor

@jtfmumm the four basic tests (grow by 1, grow by many, shrink by 1, shrink by many) are enabled with a single cycle (and longer timeouts).
They all pass locally for me, so I want to see how they do in CI.
The rest of the tests are disabled right now. This should be good to merge if CI passes and the changes you made pass review (I can review them tomorrow, or maybe @SeanTAllen can)

@jtfmumm
Copy link
Contributor Author

jtfmumm commented Feb 9, 2018

@nisanharamati These autoscale tests are passing for me locally, both automated and manually. But they continue to fail on CI. I pushed a change to make the timeout 200 and it's still failing.

@SeanTAllen
Copy link
Contributor

@jtfmumm so its a timing problem based on available resources, yes?

@jtfmumm
Copy link
Contributor Author

jtfmumm commented Feb 9, 2018

@SeanTAllen It seems like it

- Update integration.py
  - Add `start_from` option to RunnerChecker
  - Add `tell()` method to `Runner` to get current STDOUT position
  - Add is_alive and poll methods to Runner class
  - Reorganize error classes
  - Add MetricsParserError
  - Refactor integration's metrics parser
- Add multicycle tests for autoscale
- Add cluster_shrinker to test dependencies and custom path
- remove metrics based validation and related parsing code
- replace it with partition based testing
- remove `time.sleep`'s requierd for metrics to accummulate
- make sender timing easier to control
- Improve test error logging
  - Add a dedicated output formatting function to `integration`
  - Add a logging setup function to `integration`
  - Add `INFO2` log level to integration logging and use it for logging
runner command strings
  - Reorganize error messages and flow in `_autoscale_tests.py`
    - First thing that is printed is the actual error stack trace
    - Followed optionally by the output of each of the workers
    - Followed by a report of any workers that may have exited with a
      non-zero return code, and the last 5 logs lines from each.
    - Followed by a report of where in the autoscale sequence the test
      had failed.
    - Followed by a reprint of the original error message (without the
      traceback and worker logs)
    - Followed the STDERR output from logs created using python's
      `logging` module if any were present, including the command strings
      each runner used for its subprocess (e.g. the wallaroo/machida/go
      commands)
    Note that all of these messages only show up in the case of a test
    error. Passing tests do not print any output.
- comment out most tests, leaving the basic grow by 1, grow by many,
  shrink by 1, and shrink by many tests with a single cycle enabled.
- Fix error in command construction resulting in both `--control` and
  `--join` addresses being used for joining workers.
@jtfmumm jtfmumm merged commit b66d0c1 into master Feb 13, 2018
wallaroolabs-wally added a commit that referenced this pull request Feb 13, 2018
@jtfmumm jtfmumm deleted the issue-1929 branch February 23, 2018 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lost messages after shrink to fit

3 participants