Conversation
|
@SeanTAllen approved my changes which were part of #2003. This PR also includes Nisan's ongoing work on autoscale tests. |
7177382 to
5f76329
Compare
|
So this comes back to the problem where we are apparently unregistering more than once? |
|
@jtfmumm this seems a little messed up, there's abot 15 commits that are listed as part of this. |
|
@SeanTAllen Sometimes when running the automated test with the large partition that Invariant was being violated, but not every time. I never saw it violated when running the large partition grow cycles manually. I wouldn't say it helps explain much here, though it's possible we're periodically seeing something weird that manifests as calling dispose() on the same route more than once. |
|
@SeanTAllen There's the big commit with the major fix, my 4 independent fixes, and then a bunch of commits from @nisanharamati that might need to be squashed. |
After shrink to fit, not all routers were being correctly updated to point to the new state steps. This was not caught immediately because we weren't failing when there was no route found at a partition router, which is also fixed in this commit. Closes #2000
When growing and shrinking, we appear to be violating some Invariants about not unregistering the same producer twice when we're dealing with large partitions being migrated. This does not appear to be causing an actual problem, so I'm opening an issue and commenting out the Invariants for now.
I need more partitions for testing partition rebalancing in autoscale conditions, so I increased the partition count for the alphabet test apps from 27 to 667 by changing the "letter" from a letter to a 2-letter key and creating a partition for each pair in [aa, ab, ac,..., zz] + [!]
01f000e to
40de639
Compare
|
@jtfmumm the four basic tests (grow by 1, grow by many, shrink by 1, shrink by many) are enabled with a single cycle (and longer timeouts). |
|
@nisanharamati These autoscale tests are passing for me locally, both automated and manually. But they continue to fail on CI. I pushed a change to make the timeout 200 and it's still failing. |
|
@jtfmumm so its a timing problem based on available resources, yes? |
|
@SeanTAllen It seems like it |
- Update integration.py
- Add `start_from` option to RunnerChecker
- Add `tell()` method to `Runner` to get current STDOUT position
- Add is_alive and poll methods to Runner class
- Reorganize error classes
- Add MetricsParserError
- Refactor integration's metrics parser
- Add multicycle tests for autoscale
- Add cluster_shrinker to test dependencies and custom path
- remove metrics based validation and related parsing code
- replace it with partition based testing
- remove `time.sleep`'s requierd for metrics to accummulate
- make sender timing easier to control
- Improve test error logging
- Add a dedicated output formatting function to `integration`
- Add a logging setup function to `integration`
- Add `INFO2` log level to integration logging and use it for logging
runner command strings
- Reorganize error messages and flow in `_autoscale_tests.py`
- First thing that is printed is the actual error stack trace
- Followed optionally by the output of each of the workers
- Followed by a report of any workers that may have exited with a
non-zero return code, and the last 5 logs lines from each.
- Followed by a report of where in the autoscale sequence the test
had failed.
- Followed by a reprint of the original error message (without the
traceback and worker logs)
- Followed the STDERR output from logs created using python's
`logging` module if any were present, including the command strings
each runner used for its subprocess (e.g. the wallaroo/machida/go
commands)
Note that all of these messages only show up in the case of a test
error. Passing tests do not print any output.
- comment out most tests, leaving the basic grow by 1, grow by many,
shrink by 1, and shrink by many tests with a single cycle enabled.
- Fix error in command construction resulting in both `--control` and
`--join` addresses being used for joining workers.
[This replaces #2003 since it includes Nisan's tests]
After shrink to fit, not all routers were being
correctly updated to point to the new state steps.
This was not caught immediately because we weren't
failing when there was no route found at a partition
router, which is also fixed in this commit.
Closes #2000