-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Action container log collection does not wait for sentinel on developer error #4582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Action container log collection does not wait for sentinel on developer error #4582
Conversation
…er error * So far, log collection for managed Docker action containers waits for sentinels - even if a developer error occurs. Only if an action exceeded its init or run timeout, sentinel waiting will be disabled. * This change disables sentinel waiting for all developer errors because a managed action runtime may not be able to write sentinels at all in developer error situations. With this change, missing sentinels are handled in a more robust way. * Without sentinel waiting, the Docker action container log file is only read until file end is reached. No re-reads are performed if the sentinel has not been found yet. On busy systems, this may lead to a few missing log lines in case of developer errors.
Assessment
Conditions for container / developer errorIn
|
| case BlackboxStartupError(msg) => ActivationResponse.developerError(msg) |
- Application-specific error while starting a blackbox container.
- Examples: image not found in registry or network failure while pulling.
- The action container does not even start, no log reading is performed at all.
In Container:
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/containerpool/Container.scala
Line 140 in 9bef49f
ActivationResponse.developerError(Messages.timedoutActivation(timeout, true)))) - Action timeout is exceeded during init.
- Example: a Node.js action with a busy loop during init blocks the event loop.
- Expectation: managed action runtimes do not produce sentinels on timeout during init.
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/containerpool/Container.scala
Line 177 in 9bef49f
ActivationResponse.developerError(Messages.timedoutActivation(timeout, false)) - Action timeout is exceeded during run.
- Example: action runs too long.
- Expectation: managed action runtimes do not produce sentinels on timeout during run.
In ActivationResult - processInitResponseContent():
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 151 in 9bef49f
developerError(errorContent) - POST /init returns a response with status != OK (200), response is not truncated and is a valid JSON. We assume that the managed action runtime detected a problem during init and sent the response.
- Example: Node.js action code that only contains comment => no valid entry point can be found.
- Expectation: managed action runtimes should produce sentinels in this situation.
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 153 in 9bef49f
developerError(invalidInitResponse(str)) - POST /init returns a response with status != OK (200), response is not truncated but is no valid JSON. We assume that the managed action runtime detected a problem during init and sent a broken response due to a bug.
- Should not happen for managed runtimes because such failures should be caught in test. Blackbox containers may return improper responses. Can malicious actions cause sending of a broken response?
- Example: no known examples.
- Expectation: managed action runtimes should produce sentinels in this situation.
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 157 in 9bef49f
developerError(truncatedResponse(str, length, maxlength)) - POST /init returns a response with status != OK (200) and response is truncated because it exceeds the size limit (1 MiB). We assume that the managed action runtime detected a problem during init and sent a broken response due to a bug.
- Should not happen for managed runtimes because such failures should be caught in test. Blackbox containers may return improper responses. Can malicious actions cause sending of a broken response?
- Example: no known examples.
- Expectation: managed action runtimes should produce sentinels in this situation.
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 161 in 9bef49f
developerError(memoryExhausted) - The TCP connection to the action container "fails" or breaks down during POST /init or while waiting for a response AND the OOM check is positive. We assume that the managed action runtime did not have a chance to react on this error situation.
- Example: action code consumes too much physical memory by including a lot of prereq packages.
- Expectation: managed action runtimes won't produce sentinels in this situation.
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 165 in 9bef49f
developerError(abnormalInitialization) - The TCP connection to the action container "fails" or breaks down during POST /init or while waiting for a response AND the OOM check is negative. We assume that the managed action runtime did not have a chance to react on this error situation.
- Example: the action code breaks the managed action runtime process by exiting.
- Expectation: managed action runtimes won't produce sentinels in this situation. Malicious actions may be able to delay / prevent sentinel production.
In ActivationResult - processRunResponseContent():
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 197 in 9bef49f
developerError(errorContent) - POST /run returns a response with status != OK (200), response is not truncated and is a valid JSON. We assume that the managed action runtime detected a problem during / after run and sent the response.
- Example: action runtime caught an exception thrown by action code.
- Expectation: managed action runtimes should produce sentinels in this situation.
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 202 in 9bef49f
developerError(invalidRunResponse(str)) - POST /run returns a response with status != OK (200), response is not truncated and is a valid JSON - but JSON is not an object (instead: String, Number, ...). We assume that the managed action runtime detected a problem during / after run and and sent a broken response due to a bug.
- Should not happen for managed runtimes because such failures should be caught in test. Blackbox containers may return improper responses. Can malicious actions cause sending of a broken response?
- Example: no known examples.
- Expectation: managed action runtimes should produce sentinels in this situation.
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 207 in 9bef49f
developerError(invalidRunResponse(str)) - POST /run returns a response with status != OK (200), response is not truncated but is no valid JSON. We assume that the managed action runtime detected a problem during / after run and and sent a broken response due to a bug.
- Should not happen for managed runtimes because such failures should be caught in test. Blackbox containers may return improper responses. Can malicious actions cause sending of a broken response?
- Example: no known examples.
- Expectation: managed action runtimes should produce sentinels in this situation.
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 211 in 9bef49f
developerError(truncatedResponse(str, length, maxlength)) - POST /run returns a response with status != OK (200) and response is truncated because it exceeds the size limit (1 MiB). We assume that the managed action runtime detected a problem during / after run and and sent a broken response due to a bug.
- Should not happen for managed runtimes because such failures should be caught in test. Blackbox containers may return improper responses. Can malicious actions cause sending of a broken response?
- Example: no known examples.
- Expectation: managed action runtimes should produce sentinels in this situation.
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 215 in 9bef49f
developerError(memoryExhausted) - The TCP connection to the action container "fails" or breaks down during POST /run or while waiting for a response AND the OOM check is positive. We assume that the managed action runtime did not have a chance to react on this error situation.
- Example: action code consumes too much physical memory by creating too many objects.
- Expectation: managed action runtimes won't produce sentinels in this situation.
-
openwhisk/common/scala/src/main/scala/org/apache/openwhisk/core/entity/ActivationResult.scala
Line 219 in 9bef49f
developerError(abnormalRun) - The TCP connection to the action container "fails" or breaks down during POST /run or while waiting for a response AND the OOM check is negative. We assume that the managed action runtime did not have a chance to react on this error situation.
- Example: the action code breaks the managed action runtime process by exiting.
- Expectation: managed action runtimes won't produce sentinels in this situation. Malicious actions may be able to delay / prevent sentinel production.
markusthoemmes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find and solid fix. Thanks! 🎉 A few suggestions throughout.
| * There are two possible modes controlled by parameter waitForSentinel: | ||
| * | ||
| * Only parses and returns as much logs as fit in the passed log limit. | ||
| * 1. Wait for sentinel: tail container log file until two sentinel markers show up. Complete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please break the description to the next line. Indent by one or two spaces if you feel the necessity there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Re-formatting as suggested.
| logs.concat( | ||
| Source.single(ByteString(LogLine(Instant.now.toString, "stderr", Messages.logFailure).toJson.compactPrint))) | ||
| Source.single( | ||
| ByteString(LogLine(Instant.now.toString, "stderr", Messages.logWarningDeveloperError).toJson.compactPrint))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the log will end with both a logging error and this error, is that correct? If so: maybe we should drop the TimeoutException from the recover below and add that to the the stream here. We can then decide if we want to append the general log failure or this new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the log will end with both a logging error and this error, is that correct?
I think no, because:
- If we have a developer error, we won't perform tailing log file reading but only read to the file's current end. In addition, we do not require to find 2 sentinel markers.
- With these conditions, the likelihood to run into idle / completion timeouts should be low.
- All other errors (framing, log limit exceeded) could possibly occur - but there is no correlation to the developer error.
My understanding is that you are thinking about breaking ties when we have two different error conditions like developer error AND timeout / framing error / log limit exceeded?
From my point of view, there are three perspectives on that:
-
As action developer, I use the action logs to understand and debug problems with init'ing and running actions. For the action developer, it probably doesn't matter which error appears in the last log line or whether the two last log lines mention more specific error conditions (
There was an issue ...andThe action did not initialize or run as expected...) - as long as there is a hint what happened and that the log is probably truncated. -
From a container proxy perspective, I want to remove (and not re-use) an action container that had a developer or log collecting and processing error. From that perspective, it doesn't matter which of these two error conditions arrives in the container proxy.
-
As operator, I do not want see indications (system log messages, metrics) for failed log collections if there is a developer error - but log collecting and processing works properly. Only if a timeout / framing error / log limit exceeded occurs, I want a log collecting failure. That's why I excluded
logWarningDeveloperErroras possible error message.
The current variant would not "see" any of the current error messages There was an issue while collecting your logs. Data might be missing. and Logs were truncated because the total bytes size exceeds the limit of ${limit.toBytes} bytes. in case of a developer error because the last message would always be The action did not initialize or run as expected. Log data might be missing.. From that perspective, it would be good append the developer error message after checking the last log line for log collecting / processing errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the exhaustive writeup. My only concern was having multiple error messages in the log file. The missing piece in my head was that we don't tail in the non-sentinel case (I'm hitting my past self for this, I should have introduced a separate path for that which doesn't have the idleTimeout for instance 😂 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markusthoemmes I pushed a new commit that should properly deal with log collecting errors when there is a developer error. Please have a look. It's no "nice" code though...
| // Define a time limit for collecting and processing the logs of a single activation. | ||
| // If this time limit is exceeded, log processing is stopped and declared unsuccessful. | ||
| // Allow 2 seconds per MB log limit, with a minimum of 10 seconds. | ||
| val logProcessingTimeout = limit.toMB.toInt.max(5) * 2.seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat arbitrary and might need some tweaking. Can you run this code through a loadtest to determine we don't hit this overall timeout under load too often?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I checked production histogram metrics for invoker.collect_logs_finish and the max value for around 20 million action container log reads is below 1.2 seconds. So I conclude that 20 seconds for a 10 MiB log limit should be a reasonable setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @rabbah's comment below, I'm fine with rather going longer than too short on this for now. It's only a safeguard anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice @sven-lange-last thanks for checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markusthoemmes @rabbah I explicitly want to check with you on the total timeout setting. Are you ok with the setting above (2s / MiB log limit, minimum 10s) - given that this is "just" a safeguard to address other currently unknown problems? As discussed, the developer error is mainly addressed by changing from sentinel-based tailing to non-sentinel reading to end-of-file without tailing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I am fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK with me though I would have also been OK with a lower hard cap also. I think this is a dilation factor we can move to the configuration parameters.
| // between two loglines appear without relying on the log frequency in the action itself. | ||
| .idleTimeout(waitForLogs) | ||
| // Apply an overall time limit for this log collecting and processing stream. | ||
| .completionTimeout(logProcessingTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like this is the crux of the patch - time limit the log extraction until sentinel is found. I don’t know what a reasonable rate should be but up to 2s/1MB strikes me as long. I like Markus’ suggestion and wonder what the drain rate might be under load. That’s especially when waiting is on the critical path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really the crux but an additional safeguard as I read it. It's a "catchall".
The crux here is to ignore sentinels (and thus don't tail logs, which guarantees a finish of the stream) in error cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above on why I thought/think this is the essence of the patch. Arguably this limit should have already existed when waiting for logs to drain. But I understand the combined effect as well now, thanks to the excellent comments.
* Improve error handling such that developer errors are no log collecting errors. Properly detect log collecting errors independently of developer errors. * Add unit test coverage for different error situations.
rabbah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I defer to @markusthoemmes on readiness to merge.
| val isTimedoutActivation = activation.isTimedoutActivation | ||
| val logs = logStream(transid, container, action, isTimedoutActivation) | ||
| val logLimit = action.limits.logs | ||
| val isDeveloperError = activation.response.isContainerError // container error means developer error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the change here is that any time the tcp connection is severed, the invoker will not wait for logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's true - it will only read until the end of the log file and not tail until a sentinel log marker is found. And also for more conditions - see here: #4582 (comment).
| * in a managed action runtime and prevent the production of sentinels. In that case, log file | ||
| * reading may continue endlessly. | ||
| * | ||
| * For these reasons, do not wait for sentinels to appear in log output when activations end up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this answers my question above - thanks.
If the log reading is always capped (don't wait more than X seconds), does this distinction becomes really necessary? I think it's a heuristic based on the intuition implied by the comment (container aborted, it's unlikely a sentinel will exist).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal was to quickly complete log collecting if a developer error occurred. That's accomplished with not tailing the log file on developer error.
The total log collecting timeout was "just" meant as a safeguard for further, currently unknown flaws.
If we used a total log collecting timeout also for developer errors, we would give action containers some additional seconds of CPU time.
|
|
||
| protected[core] def messageForCode(code: Int) = { | ||
| require(code >= 0 && code <= 3) | ||
| require(code >= Success && code <= WhiskError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could probably turn these into case objects... in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - that also came into my mind... But I was already happy to replace numbers with vals.
| // Define a time limit for collecting and processing the logs of a single activation. | ||
| // If this time limit is exceeded, log processing is stopped and declared unsuccessful. | ||
| // Allow 2 seconds per MB log limit, with a minimum of 10 seconds. | ||
| val logProcessingTimeout = limit.toMB.toInt.max(5) * 2.seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK with me though I would have also been OK with a lower hard cap also. I think this is a dilation factor we can move to the configuration parameters.
| // between two loglines appear without relying on the log frequency in the action itself. | ||
| .idleTimeout(waitForLogs) | ||
| // Apply an overall time limit for this log collecting and processing stream. | ||
| .completionTimeout(logProcessingTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above on why I thought/think this is the essence of the patch. Arguably this limit should have already existed when waiting for logs to drain. But I understand the combined effect as well now, thanks to the excellent comments.
| * @param isDeveloperError did activation fail due to developer error? | ||
| * @return true if log collecting failed, false otherwise | ||
| */ | ||
| protected def isLogCollectingError(actLogs: ActivationLogs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here kinda what I was fearing earlier 😂 . Should we at least leave a TODO to return errors from the log gathering itself and append the appropriate lines based on these after the fact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... I feel ashamed about this piece of code... at the same time, I had the feeling that I would have to improve other areas as well.
For example: the log reading pipeline recovers exceptions with regular stream elements, i.e. log messages that we check for later on. If an action just logged one of these messages as last log line, we would detect this a log collecting error...
I tried a lot of different approaches and one of the problems was that the DockerToActivationFileLogStore has a fan-out to different sinks so that the "injection" of error messages needs to be performed at a reasonable stage of the pipeline. That probably requires a custom flow or sink.
I guess it's a good idea to leave a TODO in the comments.
Codecov Report
@@ Coverage Diff @@
## master #4582 +/- ##
==========================================
- Coverage 84.52% 78.81% -5.72%
==========================================
Files 183 183
Lines 8258 8269 +11
Branches 552 566 +14
==========================================
- Hits 6980 6517 -463
- Misses 1278 1752 +474
Continue to review full report at Codecov.
|
|
@markusthoemmes @rabbah I think I addressed your review comments. Please check and let me know. @rabbah concerning your comment / request:
I didn't fully expose this factor as a configuration parameter yet because we were fine so far with also having other timeouts as Lines 176 to 178 in b5b88ec
|
Yes, as I wrote, looked good to me already ;) |
|
@markusthoemmes please let me know whether anything is still required from your side. @style95 please check whether you are ok. I would like to get this PR merged if everybody is ok... |
style95
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sven-lange-last Thank you for giving me time to look into this.
It looks good to me and would be fine with our production system as well.
markusthoemmes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…er error (apache#4582) * Use value definitions instead of literal values * Action container log collection does not wait for sentinel on developer error * So far, log collection for managed Docker action containers waits for sentinels - even if a developer error occurs. Only if an action exceeded its init or run timeout, sentinel waiting will be disabled. * This change disables sentinel waiting for all developer errors because a managed action runtime may not be able to write sentinels at all in developer error situations. With this change, missing sentinels are handled in a more robust way. * Without sentinel waiting, the Docker action container log file is only read until file end is reached. No re-reads are performed if the sentinel has not been found yet. On busy systems, this may lead to a few missing log lines in case of developer errors. * Add review feedback: reformat block comments * Fix ActionLimitsTests failure * Address review feedback * Improve error handling such that developer errors are no log collecting errors. Properly detect log collecting errors independently of developer errors. * Add unit test coverage for different error situations. * Address review feedback: add comments and centralize log collecting config
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4582 +/- ##
==========================================
- Coverage 84.52% 78.81% -5.72%
==========================================
Files 183 183
Lines 8258 8269 +11
Branches 552 566 +14
==========================================
- Hits 6980 6517 -463
- Misses 1278 1752 +474 ☔ View full report in Codecov by Sentry. |
Related issue and scope
My changes affect the following components
Types of changes
Checklist: