Skip to content

Conversation

@sven-lange-last
Copy link
Member

  • So far, log collection for managed Docker action containers waits for sentinels - even if a developer error occurs. Only if an action exceeded its init or run timeout, sentinel waiting will be disabled.
  • This change disables sentinel waiting for all developer errors because a managed action runtime may not be able to write sentinels at all in developer error situations. With this change, missing sentinels are handled in a more robust way.
  • Without sentinel waiting, the Docker action container log file is only read until file end is reached. No re-reads are performed if the sentinel has not been found yet. On busy systems, this may lead to a few missing log lines in case of developer errors.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

…er error

* So far, log collection for managed Docker action containers waits for sentinels - even if a developer error occurs. Only if an action exceeded its init or run timeout, sentinel waiting will be disabled.
* This change disables sentinel waiting for all developer errors because a managed action runtime may not be able to write sentinels at all in developer error situations. With this change, missing sentinels are handled in a more robust way.
* Without sentinel waiting, the Docker action container log file is only read until file end is reached. No re-reads are performed if the sentinel has not been found yet. On busy systems, this may lead to a few missing log lines in case of developer errors.
@sven-lange-last
Copy link
Member Author

Assessment

  • Following assessment lists all situations where a container / developer error may occur.
  • This PR changes log collecting and processing for container / developer errors.
  • While this PR makes log collecting and processing more robust when container / developer errors are detected, it has the downside that a few log lines may be lost.

Conditions for container / developer error

In ContainerProxy:

In Container:

In ActivationResult - processInitResponseContent():

    • POST /init returns a response with status != OK (200), response is not truncated and is a valid JSON. We assume that the managed action runtime detected a problem during init and sent the response.
    • Example: Node.js action code that only contains comment => no valid entry point can be found.
    • Expectation: managed action runtimes should produce sentinels in this situation.
    • POST /init returns a response with status != OK (200), response is not truncated but is no valid JSON. We assume that the managed action runtime detected a problem during init and sent a broken response due to a bug.
    • Should not happen for managed runtimes because such failures should be caught in test. Blackbox containers may return improper responses. Can malicious actions cause sending of a broken response?
    • Example: no known examples.
    • Expectation: managed action runtimes should produce sentinels in this situation.
  • developerError(truncatedResponse(str, length, maxlength))

    • POST /init returns a response with status != OK (200) and response is truncated because it exceeds the size limit (1 MiB). We assume that the managed action runtime detected a problem during init and sent a broken response due to a bug.
    • Should not happen for managed runtimes because such failures should be caught in test. Blackbox containers may return improper responses. Can malicious actions cause sending of a broken response?
    • Example: no known examples.
    • Expectation: managed action runtimes should produce sentinels in this situation.
    • The TCP connection to the action container "fails" or breaks down during POST /init or while waiting for a response AND the OOM check is positive. We assume that the managed action runtime did not have a chance to react on this error situation.
    • Example: action code consumes too much physical memory by including a lot of prereq packages.
    • Expectation: managed action runtimes won't produce sentinels in this situation.
    • The TCP connection to the action container "fails" or breaks down during POST /init or while waiting for a response AND the OOM check is negative. We assume that the managed action runtime did not have a chance to react on this error situation.
    • Example: the action code breaks the managed action runtime process by exiting.
    • Expectation: managed action runtimes won't produce sentinels in this situation. Malicious actions may be able to delay / prevent sentinel production.

In ActivationResult - processRunResponseContent():

    • POST /run returns a response with status != OK (200), response is not truncated and is a valid JSON. We assume that the managed action runtime detected a problem during / after run and sent the response.
    • Example: action runtime caught an exception thrown by action code.
    • Expectation: managed action runtimes should produce sentinels in this situation.
    • POST /run returns a response with status != OK (200), response is not truncated and is a valid JSON - but JSON is not an object (instead: String, Number, ...). We assume that the managed action runtime detected a problem during / after run and and sent a broken response due to a bug.
    • Should not happen for managed runtimes because such failures should be caught in test. Blackbox containers may return improper responses. Can malicious actions cause sending of a broken response?
    • Example: no known examples.
    • Expectation: managed action runtimes should produce sentinels in this situation.
    • POST /run returns a response with status != OK (200), response is not truncated but is no valid JSON. We assume that the managed action runtime detected a problem during / after run and and sent a broken response due to a bug.
    • Should not happen for managed runtimes because such failures should be caught in test. Blackbox containers may return improper responses. Can malicious actions cause sending of a broken response?
    • Example: no known examples.
    • Expectation: managed action runtimes should produce sentinels in this situation.
  • developerError(truncatedResponse(str, length, maxlength))

    • POST /run returns a response with status != OK (200) and response is truncated because it exceeds the size limit (1 MiB). We assume that the managed action runtime detected a problem during / after run and and sent a broken response due to a bug.
    • Should not happen for managed runtimes because such failures should be caught in test. Blackbox containers may return improper responses. Can malicious actions cause sending of a broken response?
    • Example: no known examples.
    • Expectation: managed action runtimes should produce sentinels in this situation.
    • The TCP connection to the action container "fails" or breaks down during POST /run or while waiting for a response AND the OOM check is positive. We assume that the managed action runtime did not have a chance to react on this error situation.
    • Example: action code consumes too much physical memory by creating too many objects.
    • Expectation: managed action runtimes won't produce sentinels in this situation.
    • The TCP connection to the action container "fails" or breaks down during POST /run or while waiting for a response AND the OOM check is negative. We assume that the managed action runtime did not have a chance to react on this error situation.
    • Example: the action code breaks the managed action runtime process by exiting.
    • Expectation: managed action runtimes won't produce sentinels in this situation. Malicious actions may be able to delay / prevent sentinel production.

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find and solid fix. Thanks! 🎉 A few suggestions throughout.

* There are two possible modes controlled by parameter waitForSentinel:
*
* Only parses and returns as much logs as fit in the passed log limit.
* 1. Wait for sentinel: tail container log file until two sentinel markers show up. Complete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break the description to the next line. Indent by one or two spaces if you feel the necessity there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Re-formatting as suggested.

logs.concat(
Source.single(ByteString(LogLine(Instant.now.toString, "stderr", Messages.logFailure).toJson.compactPrint)))
Source.single(
ByteString(LogLine(Instant.now.toString, "stderr", Messages.logWarningDeveloperError).toJson.compactPrint)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the log will end with both a logging error and this error, is that correct? If so: maybe we should drop the TimeoutException from the recover below and add that to the the stream here. We can then decide if we want to append the general log failure or this new one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the log will end with both a logging error and this error, is that correct?

I think no, because:

  • If we have a developer error, we won't perform tailing log file reading but only read to the file's current end. In addition, we do not require to find 2 sentinel markers.
  • With these conditions, the likelihood to run into idle / completion timeouts should be low.
  • All other errors (framing, log limit exceeded) could possibly occur - but there is no correlation to the developer error.

My understanding is that you are thinking about breaking ties when we have two different error conditions like developer error AND timeout / framing error / log limit exceeded?

From my point of view, there are three perspectives on that:

  1. As action developer, I use the action logs to understand and debug problems with init'ing and running actions. For the action developer, it probably doesn't matter which error appears in the last log line or whether the two last log lines mention more specific error conditions (There was an issue ... and The action did not initialize or run as expected...) - as long as there is a hint what happened and that the log is probably truncated.

  2. From a container proxy perspective, I want to remove (and not re-use) an action container that had a developer or log collecting and processing error. From that perspective, it doesn't matter which of these two error conditions arrives in the container proxy.

  3. As operator, I do not want see indications (system log messages, metrics) for failed log collections if there is a developer error - but log collecting and processing works properly. Only if a timeout / framing error / log limit exceeded occurs, I want a log collecting failure. That's why I excluded logWarningDeveloperError as possible error message.

The current variant would not "see" any of the current error messages There was an issue while collecting your logs. Data might be missing. and Logs were truncated because the total bytes size exceeds the limit of ${limit.toBytes} bytes. in case of a developer error because the last message would always be The action did not initialize or run as expected. Log data might be missing.. From that perspective, it would be good append the developer error message after checking the last log line for log collecting / processing errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the exhaustive writeup. My only concern was having multiple error messages in the log file. The missing piece in my head was that we don't tail in the non-sentinel case (I'm hitting my past self for this, I should have introduced a separate path for that which doesn't have the idleTimeout for instance 😂 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markusthoemmes I pushed a new commit that should properly deal with log collecting errors when there is a developer error. Please have a look. It's no "nice" code though...

// Define a time limit for collecting and processing the logs of a single activation.
// If this time limit is exceeded, log processing is stopped and declared unsuccessful.
// Allow 2 seconds per MB log limit, with a minimum of 10 seconds.
val logProcessingTimeout = limit.toMB.toInt.max(5) * 2.seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat arbitrary and might need some tweaking. Can you run this code through a loadtest to determine we don't hit this overall timeout under load too often?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I checked production histogram metrics for invoker.collect_logs_finish and the max value for around 20 million action container log reads is below 1.2 seconds. So I conclude that 20 seconds for a 10 MiB log limit should be a reasonable setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @rabbah's comment below, I'm fine with rather going longer than too short on this for now. It's only a safeguard anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice @sven-lange-last thanks for checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markusthoemmes @rabbah I explicitly want to check with you on the total timeout setting. Are you ok with the setting above (2s / MiB log limit, minimum 10s) - given that this is "just" a safeguard to address other currently unknown problems? As discussed, the developer error is mainly addressed by changing from sentinel-based tailing to non-sentinel reading to end-of-file without tailing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I am fine with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with me though I would have also been OK with a lower hard cap also. I think this is a dilation factor we can move to the configuration parameters.

@sven-lange-last sven-lange-last self-assigned this Aug 13, 2019
// between two loglines appear without relying on the log frequency in the action itself.
.idleTimeout(waitForLogs)
// Apply an overall time limit for this log collecting and processing stream.
.completionTimeout(logProcessingTimeout)
Copy link
Member

@rabbah rabbah Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like this is the crux of the patch - time limit the log extraction until sentinel is found. I don’t know what a reasonable rate should be but up to 2s/1MB strikes me as long. I like Markus’ suggestion and wonder what the drain rate might be under load. That’s especially when waiting is on the critical path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really the crux but an additional safeguard as I read it. It's a "catchall".

The crux here is to ignore sentinels (and thus don't tail logs, which guarantees a finish of the stream) in error cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above on why I thought/think this is the essence of the patch. Arguably this limit should have already existed when waiting for logs to drain. But I understand the combined effect as well now, thanks to the excellent comments.

* Improve error handling such that developer errors are no log collecting errors. Properly detect log collecting errors independently of developer errors.
* Add unit test coverage for different error situations.
Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I defer to @markusthoemmes on readiness to merge.

val isTimedoutActivation = activation.isTimedoutActivation
val logs = logStream(transid, container, action, isTimedoutActivation)
val logLimit = action.limits.logs
val isDeveloperError = activation.response.isContainerError // container error means developer error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the change here is that any time the tcp connection is severed, the invoker will not wait for logs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true - it will only read until the end of the log file and not tail until a sentinel log marker is found. And also for more conditions - see here: #4582 (comment).

* in a managed action runtime and prevent the production of sentinels. In that case, log file
* reading may continue endlessly.
*
* For these reasons, do not wait for sentinels to appear in log output when activations end up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this answers my question above - thanks.

If the log reading is always capped (don't wait more than X seconds), does this distinction becomes really necessary? I think it's a heuristic based on the intuition implied by the comment (container aborted, it's unlikely a sentinel will exist).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal was to quickly complete log collecting if a developer error occurred. That's accomplished with not tailing the log file on developer error.

The total log collecting timeout was "just" meant as a safeguard for further, currently unknown flaws.

If we used a total log collecting timeout also for developer errors, we would give action containers some additional seconds of CPU time.


protected[core] def messageForCode(code: Int) = {
require(code >= 0 && code <= 3)
require(code >= Success && code <= WhiskError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could probably turn these into case objects... in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - that also came into my mind... But I was already happy to replace numbers with vals.

// Define a time limit for collecting and processing the logs of a single activation.
// If this time limit is exceeded, log processing is stopped and declared unsuccessful.
// Allow 2 seconds per MB log limit, with a minimum of 10 seconds.
val logProcessingTimeout = limit.toMB.toInt.max(5) * 2.seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with me though I would have also been OK with a lower hard cap also. I think this is a dilation factor we can move to the configuration parameters.

// between two loglines appear without relying on the log frequency in the action itself.
.idleTimeout(waitForLogs)
// Apply an overall time limit for this log collecting and processing stream.
.completionTimeout(logProcessingTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above on why I thought/think this is the essence of the patch. Arguably this limit should have already existed when waiting for logs to drain. But I understand the combined effect as well now, thanks to the excellent comments.

* @param isDeveloperError did activation fail due to developer error?
* @return true if log collecting failed, false otherwise
*/
protected def isLogCollectingError(actLogs: ActivationLogs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here kinda what I was fearing earlier 😂 . Should we at least leave a TODO to return errors from the log gathering itself and append the appropriate lines based on these after the fact?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... I feel ashamed about this piece of code... at the same time, I had the feeling that I would have to improve other areas as well.

For example: the log reading pipeline recovers exceptions with regular stream elements, i.e. log messages that we check for later on. If an action just logged one of these messages as last log line, we would detect this a log collecting error...

I tried a lot of different approaches and one of the problems was that the DockerToActivationFileLogStore has a fan-out to different sinks so that the "injection" of error messages needs to be performed at a reasonable stage of the pipeline. That probably requires a custom flow or sink.

I guess it's a good idea to leave a TODO in the comments.

@codecov-io
Copy link

codecov-io commented Aug 14, 2019

Codecov Report

Merging #4582 into master will decrease coverage by 5.71%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4582      +/-   ##
==========================================
- Coverage   84.52%   78.81%   -5.72%     
==========================================
  Files         183      183              
  Lines        8258     8269      +11     
  Branches      552      566      +14     
==========================================
- Hits         6980     6517     -463     
- Misses       1278     1752     +474
Impacted Files Coverage Δ
...pache/openwhisk/core/entity/ActivationResult.scala 96.92% <100%> (ø) ⬆️
...ainerpool/logging/DockerToActivationLogStore.scala 100% <100%> (ø) ⬆️
...cala/org/apache/openwhisk/http/ErrorResponse.scala 92.3% <100%> (+0.08%) ⬆️
...rpool/logging/DockerToActivationFileLogStore.scala 68.08% <100%> (-0.67%) ⬇️
...sk/core/containerpool/docker/DockerContainer.scala 96.42% <100%> (+0.08%) ⬆️
...core/database/cosmosdb/RxObservableImplicits.scala 0% <0%> (-100%) ⬇️
...ore/database/cosmosdb/cache/CacheInvalidator.scala 0% <0%> (-100%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0% <0%> (-95.89%) ⬇️
...tabase/cosmosdb/cache/CacheInvalidatorConfig.scala 0% <0%> (-94.74%) ⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala 0% <0%> (-92.6%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9de97d...1e367b5. Read the comment docs.

@sven-lange-last
Copy link
Member Author

@markusthoemmes @rabbah I think I addressed your review comments. Please check and let me know.

@rabbah concerning your comment / request:

I think this is a dilation factor we can move to the configuration parameters.

I didn't fully expose this factor as a configuration parameter yet because we were fine so far with also having other timeouts as vals local to the class. See

protected val waitForLogs: FiniteDuration = 2.seconds
protected val waitForOomState: FiniteDuration = 2.seconds
protected val filePollInterval: FiniteDuration = 5.milliseconds

  • That's why I renamed one of the timeouts for more clarity and added logCollectingTimeoutPerMBLogLimit to this list.
  • I suggest that we should expose all or none of these config parameters and only if there is a need to change these defaults. @rabbah okay to delay loading these config parameters or do you need config loading already now?

@rabbah
Copy link
Member

rabbah commented Aug 15, 2019

@rabbah okay to delay loading these config parameters or do you need config loading already now?

Yes, as I wrote, looked good to me already ;)

@sven-lange-last
Copy link
Member Author

@markusthoemmes please let me know whether anything is still required from your side.

@style95 please check whether you are ok.

I would like to get this PR merged if everybody is ok...

Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven-lange-last Thank you for giving me time to look into this.
It looks good to me and would be fine with our production system as well.

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markusthoemmes markusthoemmes merged commit 92ac571 into apache:master Aug 16, 2019
BillZong pushed a commit to BillZong/openwhisk that referenced this pull request Nov 18, 2019
…er error (apache#4582)

* Use value definitions instead of literal values

* Action container log collection does not wait for sentinel on developer error

* So far, log collection for managed Docker action containers waits for sentinels - even if a developer error occurs. Only if an action exceeded its init or run timeout, sentinel waiting will be disabled.
* This change disables sentinel waiting for all developer errors because a managed action runtime may not be able to write sentinels at all in developer error situations. With this change, missing sentinels are handled in a more robust way.
* Without sentinel waiting, the Docker action container log file is only read until file end is reached. No re-reads are performed if the sentinel has not been found yet. On busy systems, this may lead to a few missing log lines in case of developer errors.

* Add review feedback: reformat block comments

* Fix ActionLimitsTests failure

* Address review feedback

* Improve error handling such that developer errors are no log collecting errors. Properly detect log collecting errors independently of developer errors.
* Add unit test coverage for different error situations.

* Address review feedback: add comments and centralize log collecting config
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.81%. Comparing base (d9de97d) to head (1e367b5).
Report is 529 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (d9de97d) and HEAD (1e367b5). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (d9de97d) HEAD (1e367b5)
6 3
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4582      +/-   ##
==========================================
- Coverage   84.52%   78.81%   -5.72%     
==========================================
  Files         183      183              
  Lines        8258     8269      +11     
  Branches      552      566      +14     
==========================================
- Hits         6980     6517     -463     
- Misses       1278     1752     +474     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants