Skip to content

Make intermittently failing concurrencyTests more stable.#4281

Closed
cbickel wants to merge 1 commit intoapache:masterfrom
cbickel:ct
Closed

Make intermittently failing concurrencyTests more stable.#4281
cbickel wants to merge 1 commit intoapache:masterfrom
cbickel:ct

Conversation

@cbickel
Copy link
Contributor

@cbickel cbickel commented Feb 13, 2019

The test Action concurrency limits should execute activations sequentially when concurrency = 1 fails intermittently. The reason is, that 4 activations are started in parallel.
If there is a delay to one of these activations, the last activation will reuse an existing container, which has already an incremented counter.
This PR changes the action to decrement the counter before leaving the action.

This will also enhance local debugging, as the warm container can be reused.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@codecov-io
Copy link

codecov-io commented Feb 13, 2019

Codecov Report

Merging #4281 into master will increase coverage by 21.16%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4281       +/-   ##
===========================================
+ Coverage   59.81%   80.97%   +21.16%     
===========================================
  Files         163      163               
  Lines        7594     7594               
  Branches      502      502               
===========================================
+ Hits         4542     6149     +1607     
+ Misses       3052     1445     -1607
Impacted Files Coverage Δ
.../org/apache/openwhisk/core/entity/EntityPath.scala 100% <0%> (+1.88%) ⬆️
...a/org/apache/openwhisk/core/entity/Parameter.scala 95.45% <0%> (+2.27%) ⬆️
.../org/apache/openwhisk/http/PoolingRestClient.scala 90% <0%> (+3.33%) ⬆️
...hisk/core/controller/actions/SequenceActions.scala 92.1% <0%> (+3.5%) ⬆️
...pache/openwhisk/core/containerpool/Container.scala 88.6% <0%> (+3.79%) ⬆️
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala 4% <0%> (+4%) ⬆️
...tainerpool/docker/DockerClientWithFileAccess.scala 96% <0%> (+4%) ⬆️
...rg/apache/openwhisk/core/entity/ExecManifest.scala 97.87% <0%> (+4.25%) ⬆️
...he/openwhisk/core/database/CouchDbRestClient.scala 87.23% <0%> (+4.25%) ⬆️
...a/org/apache/openwhisk/core/entity/TimeLimit.scala 95.45% <0%> (+4.54%) ⬆️
... and 92 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 619c86d...afaefcb. Read the comment docs.

@tysonnorris
Copy link
Contributor

I guess this is OK, but I'm not sure what the test failure looks like?

The reason is, that 4 activations are started in parallel.
If there is a delay to one of these activations, the last activation will reuse an existing container, which has already an incremented counter.

But the test is for explicitly reusing the container - are you saying a different test is unexpectedly using the same container? If so, there is still a risk I think in case that other test uses the container before the interval is up that resets the counter. I don't see other tests sharing the same action that uses concurrent.js

Or is this a case where of the 4 activations - one (or more) take > 30s to arrive, which results in rejection? If one arrives after another is rejected, I expect the container to be destroyed and not reused.

// Licensed to the Apache Software Foundation (ASF) under one or more contributor
// license agreements; and to You under the Apache License, Version 2.0.

const Promise = require('bluebird');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use finally to decrement the counter again. But finally is only part of node10. The alternative on handling the counter would be to duplicate the code to decrement it (once in error case and once in success case).

That means I only had the choice between using Promises of bluebird in default node version or to use native Promises with node10. And I decided to use bluebird as it seems to be already part of our action container and I didn't want to hardcode the version for these tests.

}).finally(() => {
// Before leaving the container, decrement the counter again. But wait twice the interval, before decrementing it. Otherwise, the
// other requests might not realise, that all requests were inside the container.
setTimeout(() => counter--, 2 * interval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is finally executed - ie is there a guarantee the container isn't paused first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally is executed after the promise, regardless of the promises fade. It is the equivalent of scalas Future.andThen { ... }.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @rabbah's comment is towards: Is it guaranteed to be executed before the next .then block? (Scala's andThen forces that ordering).

Copy link
Contributor

@tysonnorris tysonnorris Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to:

  • remove the finally (and the setTimeout within)
  • add counter=0; to immediately before resolve(result) (within checkRequests)

This way there is no timing issue (potentially racing pause grace timeout), and the counter always gets reset unless a reject occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markusthoemmes @rabbah
If I understand the following documentation of http://bluebirdjs.com/docs/api/finally.html correctly, I would say, that the finally will always be executed after the next then:

If the handler function passed to .finally returns a promise, the promise returned by .finally will not be settled until the promise returned by the handler is settled. If the handler fulfills its promise, the returned promise will be fulfilled or rejected with the original value. If the handler rejects its promise, the returned promise will be rejected with the handler's value.

As finally returns a new promise, which is required for the next .then-block, I should be executed first.

@tysonnorris
I don't know, if I get your proposal completely, but this action is also used for other tests, which expect, that there are concurrent requests within the same container.
By removing the setTimeout before decrementing, only one of these requests has the chance to realise, that all expected requests are inside the container now. By setting the counter to 0, this would not be the correct behaviour for tests, that expect several requests inside this container. And adapting the counter only in success case would also be the wrong behaviour.

Copy link
Contributor

@tysonnorris tysonnorris Feb 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbickel How about:

  • add a new variable to track pending requests count
  • decrement that before resolve
  • reset counter when pending reaches 0
let pending = 0;
...
//when request begins:
        counter++;
        pending++;
...
//in checkRequests:
        pending--;
        if (pending == 0){
            counter=0; //counter will be reset exactly once, when pending reaches 0
        } 

Just trying to make it even less sensitive to timing configs.

And adapting the counter only in success case would also be the wrong behaviour.

Is it? In case of rejections, won't the container not be used as warm anymore (should be destroyed instead of reused)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the following problems with this implementation:

  • the counter doesn't represent the amount of requests in the container anymore.
  • as the action is only failed with an application-error, the container will not be removed and will potentially be reused. So we need to duplicate the code of decrementing the counter.

Copy link
Contributor

@tysonnorris tysonnorris Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow, this seems to work testing locally.

the counter doesn't represent the amount of requests in the container anymore.

How so? we only reset counter at time of resolve, and resolve only if the counter is expected value (n in case of n concurrent supported activations)

as the action is only failed with an application-error, the container will not be removed

I see - I get the error behavior wrong all the time :(... so I guess regardless of any reject happening, the test where reject occurs will fail, and within that test, at least the rejected activation will fail, but additionally another may fail (due to container reuse), which seems ok within the same action+test? (the tests all use uniquely named actions, so there should not be reuse between tests).
WDYT?

@rabbah rabbah added testing review Review for this PR has been requested and yet needs to be done. heisenbug labels Feb 13, 2019
@cbickel
Copy link
Contributor Author

cbickel commented Feb 14, 2019

@tysonnorris
The test execute activations sequentially when concurrency = 1 creates the concurrent.js action with a concurrency of 1.
Afterwards, 4 activations are started in parallel.
As the concurrency is set to one, there should be 4 user-containers. In each container, the counter will be set to one.
But there could be the issue, that one of the 4 actions arrives with a delay at the invoker. If the delay is high enough, that one of the other 3 actions is already finished, the warm container will be reused, instead of a new one. As the counter is not set to 0 in a warm container, the condition counter == requestCount will never be true (conter is already set to 2, but requestCount is 1).
If an action decrements the counter on leaving again, the counter will always return the amount of actions that are currently inside this container.

@tysonnorris
Copy link
Contributor

@tysonnorris
The test execute activations sequentially when concurrency = 1 creates the concurrent.js action with a concurrency of 1.

ahhhh. sorry I misunderstood. I see the issue now, just have a question about the impl.

The test `Action concurrency limits should execute activations sequentially when concurrency = 1` fails intermittently. The reason is, that 4 activations are started in parallel.
If there is a delay to one of these activations, the last activation will reuse an existing container, which has already an incremented counter.
This PR changes the action to decrement the counter before leaving the action.

This will also enhance local debugging, as the warm container can be reused.
@style95 style95 added the stale old issue which needs to validate label May 20, 2023
@style95
Copy link
Member

style95 commented May 20, 2023

@cbickel
I marked this PR as stale to close soon due to long inactivity.
Please feel free to let me know if you would work on it again.

@style95 style95 closed this May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

heisenbug review Review for this PR has been requested and yet needs to be done. stale old issue which needs to validate testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Comments