Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 out of 3 bbs acting as master #347

Closed
paolostivanin opened this issue Aug 23, 2017 · 12 comments
Closed

2 out of 3 bbs acting as master #347

paolostivanin opened this issue Aug 23, 2017 · 12 comments
Labels

Comments

@paolostivanin
Copy link
Contributor

CF v 270
Diego v 1.23.1
IaaS: AWS
2 AZs

Tonight there was a failover of the rds db (5m) and after that the system was kinda down (no push, no delete, etc). While investigating, we found out that:

  • bbs1 had "too many open files" error (but was still acting as master)
  • bbs2 took over as master but didn't do anything
  • bbs3 was idle

I guess something went wrong on the bbs side because I don't think it's normal to have 2 masters :)
What can I do to help you debug this issue?

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/150472188

The labels on this github issue will be updated when the story is started.

@emalm
Copy link
Member

emalm commented Aug 23, 2017

Thanks for letting us know about this issue, @paolostivanin. Would you be able to provide the stdout and stderr logs from all the BBS instances in this deployment, as well as the times when the RDS instance failed over or times when you noticed or resolved these degradations of service?

Some additional questions:

  • Are the BBS instances configured to acquire their locks in Consul, Locket, or both?
  • Could you clarify what you mean when you say that BBS Use a list of etcd endpoints rather than the single leader ip. #1 was still acting as master and that BBS No description or README for this repo  #2 took over as master but didn't do anything? Which instance or instances were handling API requests? Which instance or instances were performing convergence operations? If you happen to know it, which instances held the BBS lock during which periods of time?

Thanks,
Eric, CF Diego PM

@paolostivanin
Copy link
Contributor Author

@emalm yes, I do have the stderr and stdout logs for the 2 bbs that were acting as master. Can I send them to you as they are of do I have to "clean" them before? AFAIK these logs shouldn't contain credentials, tokens, etc, right?

The following times are in UTC.

  • 23.8. 01:33-01:34 - RDS used for CC DB did a fail over (took ~ 1min), this happened during the maintenance window specified for this RDS instance so it is a normal situation

  • 23.8. 01:33-01:40 - CF API (/v2/info) was not available, RDS downtime triggered restart of cloud controller process (on api nodes), unclear why this took so long

  • 23.8. 01:33-09:04 - Diego BBS (aka database) not available

    • 02:40 - Diego BBS came back automatically but with 2 masters which means it is still not working
    • 09:04 - manually repaired Diego BBS via bosh recreate, Diego BBS is available again (1 master, 2 slaves)
  • BBS instances acquire their locks using consul (locket is still marked as experimental)

  • I mean that prior to the outage only one BBS was reporting to grafana but after the outage 2 BBS were reporting at the same time. CF api and any lifecycle operation (including repair of crashed app instances) were not available. Prior to the outage, database_z1 83efe73d-0b03-47eb-a74e-52b2a2bc59df was the master. After the outage (and the manual repair) database_z2 865e9a1e-0dfe-44be-89e8-52c570d08823 became the master.

@emalm
Copy link
Member

emalm commented Aug 24, 2017

Thanks for the additional information, @paolostivanin, especially about the timing. I believe the BBS logs would be safe to send without any scrubbing to remove sensitive credentials or environmental details.

In the hopefully unlikely event that this issue occurs again before we identify the root cause and introduce any relevant fixes, could you also try killing the stuck BBS process with kill -QUIT $(pidof bbs), and then capture the goroutine dump that the process emits to the stderr log?

Thanks again,
Eric

@paolostivanin
Copy link
Contributor Author

paolostivanin commented Aug 25, 2017

Hi,
please find attached the stderr and stdout logs for bbs_z1/0 and bbs_z2/0 and /1. In case you need them, I also have other jobs' logs.

bbs_all.tar.xz.zip

Maybe broken Consul could be a root cause for the 2 BBS masters? What do you think?

@crhino
Copy link
Contributor

crhino commented Aug 30, 2017

@paolostivanin What type of RDS database are you using, postgres or MySQL?

@paolostivanin
Copy link
Contributor Author

postgres!

@crhino
Copy link
Contributor

crhino commented Aug 30, 2017

@paolostivanin In your timeline above, you have two events for the BBS. One at 02:40 where the BBS came back and two at 09:04 when you manually repaired the BBS via BOSH recreate. At what point did CF API operations start working?

@paolostivanin
Copy link
Contributor Author

@crhino CF api and any lifecycle operations (including repair of crashed app instances) were not available for ~7.5h. API started working again ~9:04 UTC

@crhino
Copy link
Contributor

crhino commented Sep 1, 2017

@paolostivanin Is it possible for you to get the cell logs from that time period? That would help diagnose the downtime issue.

@paolostivanin
Copy link
Contributor Author

paolostivanin commented Sep 4, 2017

Unfortunately no. We have a high load and hundred of cells, so we don't keep the logs too long.

@crhino
Copy link
Contributor

crhino commented Sep 5, 2017

We think we have figured out why the deployment wasn't working. Looking at the BBS logs for the node that did not shut down, you can see that there are still bbs.request.cells being logged starting soon after the BBS loses its lock. We believe this is the auctioneer attempting to fetch the CellReps from the BBS, as it retries every second (which the auctioneer is configured to do on failure), and started right as the new BBS took the lock and started to do convergence processes. Our theory is that the auctioneer had an idle TCP connection to the old BBS hanging around, and when it received work, reused that idle connection to the old BBS. That bypasses any DNS lookup, so it would not see a DNS change to the new BBS node.

The root cause of this is that the postgres driver does not provide any timeouts associated with a backing postgres database, causing the BBS to hang during shutdown. You can see here that others have experienced the problem of hanging database connections. We have written this story to implement postgres timeouts.

Hope that helps, we will close this issue since we have a story to track the work.

@crhino crhino closed this as completed Sep 5, 2017
jvshahid added a commit that referenced this issue Jun 5, 2018
Submodule src/code.cloudfoundry.org/executor 23ff7f34..08a81a58:
  > use the new gomega Receive matcher to cleanup the tests
Submodule src/github.com/onsi/ginkgo 67b9df7f5..c73579c58:
  > Increase eventually timeout to 30s
  > Clarify asynchronous test behaviour
  > Travis badge should only show master
  > v1.5.0
  > Bring the changelog up to date (#470)
  > Merge pull request #466 from onsi/issue-464
  > Merge pull request #468 from onsi/update-changelog
  > Merge pull request #459 from onsi/unfocus-exclude-vendor
  > Merge pull request #456 from idoru/ignore-suite-with-ignored-go-files
  > Merge pull request #455 from onsi/flaky-measurements
  > Merge pull request #423 from alamages/coverage-tests-fix
  > Add go 1.10 to travis
  > Merge pull request #448 from melnikk/master
  > Merge pull request #451 from onsi/go-1.10-coverage-test
  > Merge pull request #446 from onsi/do-not-create-outputdir
  > Add an extra new line after reporting spec run completion
  > Merge pull request #441 from blgm/releasing
  > Merge pull request #443 from onsi/add-go-1.10
  > Merge pull request #438 from onsi/issue-436
  > Merge pull request #434 from mezis/blur-specify
  > Merge pull request #430 from onsi/issue-426
  > Add tests for When/Specify
  > Update comments and README.md for `When` DSL
  > Merge pull request #386 from akshaymankar/when
  > Merge pull request #383 from eloycoto/master
  > fix the tests on go versions <1.9
  > add a test case for #415
  > Fix Ginkgo stack trace on failure for Specify
  > Merge pull request #409 from lflux/readme-update
  > Merge pull request #400 from alex-slynko/patch-1
  > Merge pull request #403 from alex-slynko/fix_govet
  > Merge pull request #404 from lflux/travis-go-1.9
  > Merge pull request #399 from alex-slynko/fix_imports
  > Merge pull request #401 from alex-slynko/patch-2
  > Merge pull request #398 from martinxsliu/generated-import-ordering
  > Merge pull request #390 from DennisDenuto/master
  > Re-add noisySkippings flag
  > Replace GOPATH in Environment
  > Allow coverage to be displayed for focused specs (#367)
  > ensure customer reporters are run before default reporter
  > Handle -outputdir flag (#364)
  > Handle -coverprofile flag (#355)
  > v1.4.0
  > v1.4.0
  > fix CONTRIBUTING.md
  > Add a CONTRIBUTING.md file
  > add -timeout flag.  fixes #248
  > add 'requireSuite' flag to cause failure if there are test files but no suite (#359)
  > update changelog
  > emit compilation output
  > link to vscode extension
  > add ginkgo watch -watchRegExp fixes #356
  > remove fixCompilationOutput  (fixes #357)
  > Remove mailing list
  > Add Rick and Morty quote to untilItFails messages (#354)
  > Revert "Fix wrong repo URL (#351)" (#352)
  > Fix wrong repo URL (#351)
  > Lack of test suites no longer breaks builds (#347)
  > Update CHANGELOG.md
  > Warn when a suite has test files but no tests to run (#345)
  > fix flaky test by, you know, not relying on obscure go internals
  > update .travis.yml to get latest patch releases
  > fix failing go 1.5 tests
  > ensure tests pass on Go 1.8.1
  > v1.3.1: Actually change the CLI version
  > print location of test suite that failed to report back (#339)
  > v1.3.0
Submodule src/github.com/onsi/gomega 777f4d387..41673fd8f:
  > allow 'Receive' matcher to be used with concrete types (#286)
  > Fix data race in ghttp server (#283)
  > Travis badge should only show master
  > v1.4.0
  > Update the CHANGELOG.md with all changes since last release
  > Improve message from AssignableToTypeOf when expected value is nil (#281)
  > Merge pull request #278 from benmoss/master
  > Merge pull request #273 from vayan/master
  > change Ω to Expect in most places (#268)
  > Merge pull request #272 from trayo/master
  > Merge pull request #271 from onsi/add-go-1.10
  > Merge pull request #270 from Patagonicus/master
  > Remove tests that tried to verify signals as async
  > Merge pull request #262 from onsi/issue-173
  > Merge pull request #259 from onsi/issue-258
  > v1.3.0
  > Add more robust support for XUnit style tests
  > Merge pull request #256 from onsi/allow-multiple-go-routines
  > Merge pull request #253 from challiwill/master
  > Merge pull request #249 from alex-slynko/fix_govet
  > Merge pull request #247 from alex-slynko/fix_imports
  > Merge pull request #243 from williammartin/find-first-mismatch-oob
  > Merge pull request #241 from ivy/fix-typos
  > Merge pull request #218 from alex-slynko/patch-1
  > Merge pull request #239 from williammartin/parallel-testing
  > Merge pull request #235 from onsi/fix-227
  > MatchXML should ignore the order of xml node attributes
  > Merge pull request #233 from kaedys/bugfix/error-matcher-fix
  > Uses equality matcher for matching error strings

Signed-off-by: Sunjay Bhatia <sbhatia@pivotal.io>
n4wei pushed a commit that referenced this issue Oct 9, 2019
Submodule src/github.com/onsi/ginkgo eea6ad008..d90e0dcda:
  > Add integration test
  > Fix coverage files combining
  > A new CLI option: -ginkgo.reportFile <file path> (#601)
  > v1.10.2
  > speed up table entry generateIt() (#609)
  > Fix. Write errors to stderr instead of stdout (#610)
  > v1.10.1
  > stack backtrace: fix skipping (#600)
  > v1.10.0
  > stack backtrace: fix alignment and skipping
  > fix typo in documentation
  > v1.9.0
  > Fixed typos in comments
  > gofmt code
  > Simplify code
  > Simplify concatenation, incrementation and function assignment
  > Avoid unnecessary conversions
  > JUnit: include more detailed information about panic
  > Option to print output into report, when tests have passed
  > Print help to stdout when the user asks for help
Submodule src/github.com/onsi/gomega 41673fd8f..bdebf9e0e:
  > v1.7.0
  > export format property variables (#347)
  > minor fix in the documentation of ExpectWithOffset (#358)
  > v1.6.0
  > Remove duplication in XML matcher tests
  > Remove unnecessary conversions (#357)
  > Fixed import order (#353)
  > Added missing error handling in test (#355)
  > Simplify code (#356)
  > Simplify code (#354)
  > Fixed typos (#352)
  > Add failure message tests to BeElementOf matcher
  > Update go-testcov untested sections
  > mark all uncovered files so go-testcov ./... works
  > display special chars on error
  > Reenable gotip in travis
  > Add BeElementOf matcher
  > Fix the typo of comment (#345)
  > Optimize contain_element_matcher
  > v1.5.0
  > add analogous tests to fields_test.go
  > added tests for error messages
  > remove redundant validity check
  > ensure error messages refer to the underlying interface, not the `reflect.Value`
  > Run CI on go 1.12 and not 1.9
  > Turn off gotip on travis
  > Mark session_test.go with build tag (#324)
  > more matcher tests
  > try to appease linters
  > MatchKeys
  > Clarify message for unexpected errors
  > make it easy to run everythin that travis would do
  > remove go 1.6, 1.7, 1.8 from the build matrix, and add tip
  > fix lint errors and use type aliases to remove stuttering
  > remove duplication of code and add docs to public function
  > v1.4.3
  > test yaml parser errors and better comment on why panics are there
  > add test for not matcher with errors
  > add test for or matcher with errors
  > add tests for positive failure messages of Panic
  > ensure file name and line numbers are correctly reported
  > Fail the build if gofmt detects any changes required
  > Make changes to satisfy gofmt
  > Move GO111MODULE into more suitable env key in Travis
  > Add coverage for BeZero and Succeed matchers (#307)
  > Fixed matcher for content-type (#305)
  > v1.4.2
  > Refactor goexec build test
  > Updates for Go v1.11 official release
  > Add go.mod and go.sum files to define the gomega go module
  > Work around go vet issue with Go v1.11 (#300)
  > Better output when using with go XUnit-style tests, fixes #255 (#297)
  > Fix MatchJSON fail to parse json.RawMessage (#298)
  > Move funcs shared by MatchJSON and MatchYAML to a dedicated support file
  > Added first failure path matching for YAML (#279)
  > show threshold in failure message of BeNumericallyMatcher (#293)
  > v1.4.1
  > Update documentation (#289)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants