-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 out of 3 bbs acting as master #347
Comments
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/150472188 The labels on this github issue will be updated when the story is started. |
Thanks for letting us know about this issue, @paolostivanin. Would you be able to provide the stdout and stderr logs from all the BBS instances in this deployment, as well as the times when the RDS instance failed over or times when you noticed or resolved these degradations of service? Some additional questions:
Thanks, |
@emalm yes, I do have the stderr and stdout logs for the 2 bbs that were acting as master. Can I send them to you as they are of do I have to "clean" them before? AFAIK these logs shouldn't contain credentials, tokens, etc, right? The following times are in UTC.
|
Thanks for the additional information, @paolostivanin, especially about the timing. I believe the BBS logs would be safe to send without any scrubbing to remove sensitive credentials or environmental details. In the hopefully unlikely event that this issue occurs again before we identify the root cause and introduce any relevant fixes, could you also try killing the stuck BBS process with Thanks again, |
Hi, Maybe broken Consul could be a root cause for the 2 BBS masters? What do you think? |
@paolostivanin What type of RDS database are you using, postgres or MySQL? |
postgres! |
@paolostivanin In your timeline above, you have two events for the BBS. One at 02:40 where the BBS came back and two at 09:04 when you manually repaired the BBS via BOSH recreate. At what point did CF API operations start working? |
@crhino CF api and any lifecycle operations (including repair of crashed app instances) were not available for ~7.5h. API started working again ~9:04 UTC |
@paolostivanin Is it possible for you to get the cell logs from that time period? That would help diagnose the downtime issue. |
Unfortunately no. We have a high load and hundred of cells, so we don't keep the logs too long. |
We think we have figured out why the deployment wasn't working. Looking at the BBS logs for the node that did not shut down, you can see that there are still The root cause of this is that the postgres driver does not provide any timeouts associated with a backing postgres database, causing the BBS to hang during shutdown. You can see here that others have experienced the problem of hanging database connections. We have written this story to implement postgres timeouts. Hope that helps, we will close this issue since we have a story to track the work. |
Submodule src/code.cloudfoundry.org/executor 23ff7f34..08a81a58: > use the new gomega Receive matcher to cleanup the tests Submodule src/github.com/onsi/ginkgo 67b9df7f5..c73579c58: > Increase eventually timeout to 30s > Clarify asynchronous test behaviour > Travis badge should only show master > v1.5.0 > Bring the changelog up to date (#470) > Merge pull request #466 from onsi/issue-464 > Merge pull request #468 from onsi/update-changelog > Merge pull request #459 from onsi/unfocus-exclude-vendor > Merge pull request #456 from idoru/ignore-suite-with-ignored-go-files > Merge pull request #455 from onsi/flaky-measurements > Merge pull request #423 from alamages/coverage-tests-fix > Add go 1.10 to travis > Merge pull request #448 from melnikk/master > Merge pull request #451 from onsi/go-1.10-coverage-test > Merge pull request #446 from onsi/do-not-create-outputdir > Add an extra new line after reporting spec run completion > Merge pull request #441 from blgm/releasing > Merge pull request #443 from onsi/add-go-1.10 > Merge pull request #438 from onsi/issue-436 > Merge pull request #434 from mezis/blur-specify > Merge pull request #430 from onsi/issue-426 > Add tests for When/Specify > Update comments and README.md for `When` DSL > Merge pull request #386 from akshaymankar/when > Merge pull request #383 from eloycoto/master > fix the tests on go versions <1.9 > add a test case for #415 > Fix Ginkgo stack trace on failure for Specify > Merge pull request #409 from lflux/readme-update > Merge pull request #400 from alex-slynko/patch-1 > Merge pull request #403 from alex-slynko/fix_govet > Merge pull request #404 from lflux/travis-go-1.9 > Merge pull request #399 from alex-slynko/fix_imports > Merge pull request #401 from alex-slynko/patch-2 > Merge pull request #398 from martinxsliu/generated-import-ordering > Merge pull request #390 from DennisDenuto/master > Re-add noisySkippings flag > Replace GOPATH in Environment > Allow coverage to be displayed for focused specs (#367) > ensure customer reporters are run before default reporter > Handle -outputdir flag (#364) > Handle -coverprofile flag (#355) > v1.4.0 > v1.4.0 > fix CONTRIBUTING.md > Add a CONTRIBUTING.md file > add -timeout flag. fixes #248 > add 'requireSuite' flag to cause failure if there are test files but no suite (#359) > update changelog > emit compilation output > link to vscode extension > add ginkgo watch -watchRegExp fixes #356 > remove fixCompilationOutput (fixes #357) > Remove mailing list > Add Rick and Morty quote to untilItFails messages (#354) > Revert "Fix wrong repo URL (#351)" (#352) > Fix wrong repo URL (#351) > Lack of test suites no longer breaks builds (#347) > Update CHANGELOG.md > Warn when a suite has test files but no tests to run (#345) > fix flaky test by, you know, not relying on obscure go internals > update .travis.yml to get latest patch releases > fix failing go 1.5 tests > ensure tests pass on Go 1.8.1 > v1.3.1: Actually change the CLI version > print location of test suite that failed to report back (#339) > v1.3.0 Submodule src/github.com/onsi/gomega 777f4d387..41673fd8f: > allow 'Receive' matcher to be used with concrete types (#286) > Fix data race in ghttp server (#283) > Travis badge should only show master > v1.4.0 > Update the CHANGELOG.md with all changes since last release > Improve message from AssignableToTypeOf when expected value is nil (#281) > Merge pull request #278 from benmoss/master > Merge pull request #273 from vayan/master > change Ω to Expect in most places (#268) > Merge pull request #272 from trayo/master > Merge pull request #271 from onsi/add-go-1.10 > Merge pull request #270 from Patagonicus/master > Remove tests that tried to verify signals as async > Merge pull request #262 from onsi/issue-173 > Merge pull request #259 from onsi/issue-258 > v1.3.0 > Add more robust support for XUnit style tests > Merge pull request #256 from onsi/allow-multiple-go-routines > Merge pull request #253 from challiwill/master > Merge pull request #249 from alex-slynko/fix_govet > Merge pull request #247 from alex-slynko/fix_imports > Merge pull request #243 from williammartin/find-first-mismatch-oob > Merge pull request #241 from ivy/fix-typos > Merge pull request #218 from alex-slynko/patch-1 > Merge pull request #239 from williammartin/parallel-testing > Merge pull request #235 from onsi/fix-227 > MatchXML should ignore the order of xml node attributes > Merge pull request #233 from kaedys/bugfix/error-matcher-fix > Uses equality matcher for matching error strings Signed-off-by: Sunjay Bhatia <sbhatia@pivotal.io>
Submodule src/github.com/onsi/ginkgo eea6ad008..d90e0dcda: > Add integration test > Fix coverage files combining > A new CLI option: -ginkgo.reportFile <file path> (#601) > v1.10.2 > speed up table entry generateIt() (#609) > Fix. Write errors to stderr instead of stdout (#610) > v1.10.1 > stack backtrace: fix skipping (#600) > v1.10.0 > stack backtrace: fix alignment and skipping > fix typo in documentation > v1.9.0 > Fixed typos in comments > gofmt code > Simplify code > Simplify concatenation, incrementation and function assignment > Avoid unnecessary conversions > JUnit: include more detailed information about panic > Option to print output into report, when tests have passed > Print help to stdout when the user asks for help Submodule src/github.com/onsi/gomega 41673fd8f..bdebf9e0e: > v1.7.0 > export format property variables (#347) > minor fix in the documentation of ExpectWithOffset (#358) > v1.6.0 > Remove duplication in XML matcher tests > Remove unnecessary conversions (#357) > Fixed import order (#353) > Added missing error handling in test (#355) > Simplify code (#356) > Simplify code (#354) > Fixed typos (#352) > Add failure message tests to BeElementOf matcher > Update go-testcov untested sections > mark all uncovered files so go-testcov ./... works > display special chars on error > Reenable gotip in travis > Add BeElementOf matcher > Fix the typo of comment (#345) > Optimize contain_element_matcher > v1.5.0 > add analogous tests to fields_test.go > added tests for error messages > remove redundant validity check > ensure error messages refer to the underlying interface, not the `reflect.Value` > Run CI on go 1.12 and not 1.9 > Turn off gotip on travis > Mark session_test.go with build tag (#324) > more matcher tests > try to appease linters > MatchKeys > Clarify message for unexpected errors > make it easy to run everythin that travis would do > remove go 1.6, 1.7, 1.8 from the build matrix, and add tip > fix lint errors and use type aliases to remove stuttering > remove duplication of code and add docs to public function > v1.4.3 > test yaml parser errors and better comment on why panics are there > add test for not matcher with errors > add test for or matcher with errors > add tests for positive failure messages of Panic > ensure file name and line numbers are correctly reported > Fail the build if gofmt detects any changes required > Make changes to satisfy gofmt > Move GO111MODULE into more suitable env key in Travis > Add coverage for BeZero and Succeed matchers (#307) > Fixed matcher for content-type (#305) > v1.4.2 > Refactor goexec build test > Updates for Go v1.11 official release > Add go.mod and go.sum files to define the gomega go module > Work around go vet issue with Go v1.11 (#300) > Better output when using with go XUnit-style tests, fixes #255 (#297) > Fix MatchJSON fail to parse json.RawMessage (#298) > Move funcs shared by MatchJSON and MatchYAML to a dedicated support file > Added first failure path matching for YAML (#279) > show threshold in failure message of BeNumericallyMatcher (#293) > v1.4.1 > Update documentation (#289)
CF v 270
Diego v 1.23.1
IaaS: AWS
2 AZs
Tonight there was a failover of the rds db (5m) and after that the system was kinda down (no push, no delete, etc). While investigating, we found out that:
I guess something went wrong on the bbs side because I don't think it's normal to have 2 masters :)
What can I do to help you debug this issue?
The text was updated successfully, but these errors were encountered: