Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BBS doesn't report new "crash" events for LRPs crashed > 5 mins ago #643

Closed
vlast3k opened this issue Oct 9, 2022 · 1 comment
Closed
Labels

Comments

@vlast3k
Copy link

vlast3k commented Oct 9, 2022

Summary

If an LRP has already crashed once and then it crashed again more than 5 mins later, a crash event is not reported.

Steps to Reproduce

  • clone and push spring-music app
  • while true; do cf ssh spring-music -c "kill -9 \$(pidof java)" ; sleep 600; done
  • check cf events and see a sequence of
audit.app.ssh-authorized   test2@test.com   index: 0
audit.app.ssh-authorized   test2@test.com   index: 0
audit.app.ssh-authorized   test2@test.com   index: 0
audit.app.ssh-authorized   test2@test.com   index: 0
audit.app.ssh-authorized   test2@test.com   index: 0

instead of

app.crash                  spring-music     index: 0, reason: CRASHED, cell_id: 4581257e-2
audit.app.ssh-authorized   test2@test.com   index: 0
app.crash                  spring-music     index: 0, reason: CRASHED, cell_id: 27ee7e73-b
audit.app.ssh-authorized   test2@test.com   index: 0
app.crash                  spring-music     index: 0, reason: CRASHED, cell_id: 4581257e-2

Diego repo

Environment Details

  • diego-release version or other BOSH releases you have deployed - Diego v2.66.2

Possible Causes or Fixes (optional)

The reason seems to be, that after the initial crash, BBS records in the DB that the "crash_count" for this LRP is 1 (or more in case of frequent subsequent crashes. But single sporadic crashes would result in crash_count = 1.

Then on subsequent crashes this code in actual_lrp_db

		var newCrashCount int32
		if latestChangeTime > models.CrashResetTimeout && actualLRP.State == models.ActualLRPStateRunning {
			newCrashCount = 1
		} else {
			newCrashCount = actualLRP.CrashCount + 1
		}

Will actually set the newCrashCount = 1 and later in actual_lrp_event_calculator.go/generateUnclaimedInstanceEvents

	if after.CrashCount > before.CrashCount {
		events = append(events, models.NewActualLRPCrashedEvent(before, after))
	}

will not append the NewActualLRPCrashedEvent because both CrashCounts are equal.

Not quite sure how to properly fix it but this fix in actual_lrp_lifecycle_controller/CrashActualLRP

lrps[0].CrashCount = after.CrashCount - 1;

mediates the issue definitely
What it does is that since we are anyway in the CrashActualLRP, we just need to ensure that the Crash Event would be sent. So setting the CrashCount of the before lrp to after-1 seems to be enough to pass the check

Additional Text Output, Screenshots, contextual information (optional)

@vlast3k vlast3k added the bug label Oct 9, 2022
@mariash
Copy link
Member

mariash commented Nov 15, 2022

Sorry, this was not communicated earlier. We just fixed this in cloudfoundry/bbs@b456522

Thank you for detailed bug report.

@mariash mariash closed this as completed Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants