BBS doesn't report new "crash" events for LRPs crashed > 5 mins ago #643

vlast3k · 2022-10-09T06:57:14Z

Summary

If an LRP has already crashed once and then it crashed again more than 5 mins later, a crash event is not reported.

Steps to Reproduce

clone and push spring-music app
while true; do cf ssh spring-music -c "kill -9 \$(pidof java)" ; sleep 600; done
check cf events and see a sequence of

audit.app.ssh-authorized   test2@test.com   index: 0
audit.app.ssh-authorized   test2@test.com   index: 0
audit.app.ssh-authorized   test2@test.com   index: 0
audit.app.ssh-authorized   test2@test.com   index: 0
audit.app.ssh-authorized   test2@test.com   index: 0

instead of

app.crash                  spring-music     index: 0, reason: CRASHED, cell_id: 4581257e-2
audit.app.ssh-authorized   test2@test.com   index: 0
app.crash                  spring-music     index: 0, reason: CRASHED, cell_id: 27ee7e73-b
audit.app.ssh-authorized   test2@test.com   index: 0
app.crash                  spring-music     index: 0, reason: CRASHED, cell_id: 4581257e-2

Diego repo

Environment Details

diego-release version or other BOSH releases you have deployed - Diego v2.66.2

Possible Causes or Fixes (optional)

The reason seems to be, that after the initial crash, BBS records in the DB that the "crash_count" for this LRP is 1 (or more in case of frequent subsequent crashes. But single sporadic crashes would result in crash_count = 1.

Then on subsequent crashes this code in actual_lrp_db

		var newCrashCount int32
		if latestChangeTime > models.CrashResetTimeout && actualLRP.State == models.ActualLRPStateRunning {
			newCrashCount = 1
		} else {
			newCrashCount = actualLRP.CrashCount + 1
		}

Will actually set the newCrashCount = 1 and later in actual_lrp_event_calculator.go/generateUnclaimedInstanceEvents

	if after.CrashCount > before.CrashCount {
		events = append(events, models.NewActualLRPCrashedEvent(before, after))
	}

will not append the NewActualLRPCrashedEvent because both CrashCounts are equal.

Not quite sure how to properly fix it but this fix in actual_lrp_lifecycle_controller/CrashActualLRP

lrps[0].CrashCount = after.CrashCount - 1;

mediates the issue definitely
What it does is that since we are anyway in the CrashActualLRP, we just need to ensure that the Crash Event would be sent. So setting the CrashCount of the before lrp to after-1 seems to be enough to pass the check

Additional Text Output, Screenshots, contextual information (optional)

The text was updated successfully, but these errors were encountered:

mariash · 2022-11-15T14:50:39Z

Sorry, this was not communicated earlier. We just fixed this in cloudfoundry/bbs@b456522

Thank you for detailed bug report.

vlast3k added the bug label Oct 9, 2022

mariash closed this as completed Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BBS doesn't report new "crash" events for LRPs crashed > 5 mins ago #643

BBS doesn't report new "crash" events for LRPs crashed > 5 mins ago #643

vlast3k commented Oct 9, 2022 •

edited

mariash commented Nov 15, 2022

BBS doesn't report new "crash" events for LRPs crashed > 5 mins ago #643

BBS doesn't report new "crash" events for LRPs crashed > 5 mins ago #643

Comments

vlast3k commented Oct 9, 2022 • edited

Summary

Steps to Reproduce

Diego repo

Environment Details

Possible Causes or Fixes (optional)

Additional Text Output, Screenshots, contextual information (optional)

mariash commented Nov 15, 2022

vlast3k commented Oct 9, 2022 •

edited