Make intermittently failing concurrencyTests more stable. by cbickel · Pull Request #4281 · apache/openwhisk

cbickel · 2019-02-13T15:47:31Z

The test Action concurrency limits should execute activations sequentially when concurrency = 1 fails intermittently. The reason is, that 4 activations are started in parallel.
If there is a delay to one of these activations, the last activation will reuse an existing container, which has already an incremented counter.
This PR changes the action to decrement the counter before leaving the action.

This will also enhance local debugging, as the warm container can be reused.

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

codecov-io · 2019-02-13T16:53:20Z

Codecov Report

Merging #4281 into master will increase coverage by 21.16%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master    #4281       +/-   ##
===========================================
+ Coverage   59.81%   80.97%   +21.16%     
===========================================
  Files         163      163               
  Lines        7594     7594               
  Branches      502      502               
===========================================
+ Hits         4542     6149     +1607     
+ Misses       3052     1445     -1607

Impacted Files	Coverage Δ
.../org/apache/openwhisk/core/entity/EntityPath.scala	`100% <0%> (+1.88%)`	⬆️
...a/org/apache/openwhisk/core/entity/Parameter.scala	`95.45% <0%> (+2.27%)`	⬆️
.../org/apache/openwhisk/http/PoolingRestClient.scala	`90% <0%> (+3.33%)`	⬆️
...hisk/core/controller/actions/SequenceActions.scala	`92.1% <0%> (+3.5%)`	⬆️
...pache/openwhisk/core/containerpool/Container.scala	`88.6% <0%> (+3.79%)`	⬆️
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala	`4% <0%> (+4%)`	⬆️
...tainerpool/docker/DockerClientWithFileAccess.scala	`96% <0%> (+4%)`	⬆️
...rg/apache/openwhisk/core/entity/ExecManifest.scala	`97.87% <0%> (+4.25%)`	⬆️
...he/openwhisk/core/database/CouchDbRestClient.scala	`87.23% <0%> (+4.25%)`	⬆️
...a/org/apache/openwhisk/core/entity/TimeLimit.scala	`95.45% <0%> (+4.54%)`	⬆️
... and 92 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 619c86d...afaefcb. Read the comment docs.

tysonnorris · 2019-02-13T18:43:43Z

I guess this is OK, but I'm not sure what the test failure looks like?

The reason is, that 4 activations are started in parallel.
If there is a delay to one of these activations, the last activation will reuse an existing container, which has already an incremented counter.

But the test is for explicitly reusing the container - are you saying a different test is unexpectedly using the same container? If so, there is still a risk I think in case that other test uses the container before the interval is up that resets the counter. I don't see other tests sharing the same action that uses concurrent.js

Or is this a case where of the 4 activations - one (or more) take > 30s to arrive, which results in rejection? If one arrives after another is rejected, I expect the container to be destroyed and not reused.

rabbah · 2019-02-13T18:48:28Z

tests/dat/actions/concurrent.js

 // Licensed to the Apache Software Foundation (ASF) under one or more contributor
 // license agreements; and to You under the Apache License, Version 2.0.

+const Promise = require('bluebird');


what is this for?

I wanted to use finally to decrement the counter again. But finally is only part of node10. The alternative on handling the counter would be to duplicate the code to decrement it (once in error case and once in success case).

That means I only had the choice between using Promises of bluebird in default node version or to use native Promises with node10. And I decided to use bluebird as it seems to be already part of our action container and I didn't want to hardcode the version for these tests.

rabbah · 2019-02-13T18:49:39Z

tests/dat/actions/concurrent.js

+        }).finally(() => {
+            // Before leaving the container, decrement the counter again. But wait twice the interval, before decrementing it. Otherwise, the
+            // other requests might not realise, that all requests were inside the container.
+            setTimeout(() => counter--, 2 * interval);


when is finally executed - ie is there a guarantee the container isn't paused first?

Finally is executed after the promise, regardless of the promises fade. It is the equivalent of scalas Future.andThen { ... }.

I think @rabbah's comment is towards: Is it guaranteed to be executed before the next .then block? (Scala's andThen forces that ordering).

Is there any reason not to:

remove the finally (and the setTimeout within)

add counter=0; to immediately before resolve(result) (within checkRequests)

This way there is no timing issue (potentially racing pause grace timeout), and the counter always gets reset unless a reject occurs.

@markusthoemmes @rabbah
If I understand the following documentation of http://bluebirdjs.com/docs/api/finally.html correctly, I would say, that the finally will always be executed after the next then:

If the handler function passed to .finally returns a promise, the promise returned by .finally will not be settled until the promise returned by the handler is settled. If the handler fulfills its promise, the returned promise will be fulfilled or rejected with the original value. If the handler rejects its promise, the returned promise will be rejected with the handler's value.

As finally returns a new promise, which is required for the next .then-block, I should be executed first.

@tysonnorris
I don't know, if I get your proposal completely, but this action is also used for other tests, which expect, that there are concurrent requests within the same container.
By removing the setTimeout before decrementing, only one of these requests has the chance to realise, that all expected requests are inside the container now. By setting the counter to 0, this would not be the correct behaviour for tests, that expect several requests inside this container. And adapting the counter only in success case would also be the wrong behaviour.

@cbickel How about:

add a new variable to track pending requests count

decrement that before resolve

reset counter when pending reaches 0

let pending = 0; ... //when request begins: counter++; pending++; ... //in checkRequests: pending--; if (pending == 0){ counter=0; //counter will be reset exactly once, when pending reaches 0 }

Just trying to make it even less sensitive to timing configs.

And adapting the counter only in success case would also be the wrong behaviour.

Is it? In case of rejections, won't the container not be used as warm anymore (should be destroyed instead of reused)?

I see the following problems with this implementation:

the counter doesn't represent the amount of requests in the container anymore.

as the action is only failed with an application-error, the container will not be removed and will potentially be reused. So we need to duplicate the code of decrementing the counter.

Not sure I follow, this seems to work testing locally.

the counter doesn't represent the amount of requests in the container anymore.

How so? we only reset counter at time of resolve, and resolve only if the counter is expected value (n in case of n concurrent supported activations)

as the action is only failed with an application-error, the container will not be removed

I see - I get the error behavior wrong all the time :(... so I guess regardless of any reject happening, the test where reject occurs will fail, and within that test, at least the rejected activation will fail, but additionally another may fail (due to container reuse), which seems ok within the same action+test? (the tests all use uniquely named actions, so there should not be reuse between tests).
WDYT?

cbickel · 2019-02-14T07:06:21Z

@tysonnorris
The test execute activations sequentially when concurrency = 1 creates the concurrent.js action with a concurrency of 1.
Afterwards, 4 activations are started in parallel.
As the concurrency is set to one, there should be 4 user-containers. In each container, the counter will be set to one.
But there could be the issue, that one of the 4 actions arrives with a delay at the invoker. If the delay is high enough, that one of the other 3 actions is already finished, the warm container will be reused, instead of a new one. As the counter is not set to 0 in a warm container, the condition counter == requestCount will never be true (conter is already set to 2, but requestCount is 1).
If an action decrements the counter on leaving again, the counter will always return the amount of actions that are currently inside this container.

tysonnorris · 2019-02-15T04:29:15Z

@tysonnorris
The test execute activations sequentially when concurrency = 1 creates the concurrent.js action with a concurrency of 1.

ahhhh. sorry I misunderstood. I see the issue now, just have a question about the impl.

The test `Action concurrency limits should execute activations sequentially when concurrency = 1` fails intermittently. The reason is, that 4 activations are started in parallel. If there is a delay to one of these activations, the last activation will reuse an existing container, which has already an incremented counter. This PR changes the action to decrement the counter before leaving the action. This will also enhance local debugging, as the warm container can be reused.

style95 · 2023-05-20T21:55:14Z

@cbickel
I marked this PR as stale to close soon due to long inactivity.
Please feel free to let me know if you would work on it again.

cbickel force-pushed the ct branch from 398ae33 to b3273f2 Compare February 13, 2019 15:47

cbickel requested a review from tysonnorris February 13, 2019 15:47

cbickel assigned tysonnorris Feb 13, 2019

rabbah reviewed Feb 13, 2019

View reviewed changes

rabbah added testing review Review for this PR has been requested and yet needs to be done. heisenbug labels Feb 13, 2019

cbickel force-pushed the ct branch from b3273f2 to afaefcb Compare February 18, 2019 07:40

style95 added the stale old issue which needs to validate label May 20, 2023

style95 closed this May 31, 2023

Conversation

cbickel commented Feb 13, 2019

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

Uh oh!

codecov-io commented Feb 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tysonnorris commented Feb 13, 2019

Uh oh!

rabbah Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

cbickel Feb 14, 2019

Choose a reason for hiding this comment

Uh oh!

rabbah Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

cbickel Feb 14, 2019

Choose a reason for hiding this comment

Uh oh!

markusthoemmes Feb 14, 2019

Choose a reason for hiding this comment

Uh oh!

tysonnorris Feb 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbickel Feb 18, 2019

Choose a reason for hiding this comment

Uh oh!

tysonnorris Feb 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbickel Feb 19, 2019

Choose a reason for hiding this comment

Uh oh!

tysonnorris Feb 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbickel commented Feb 14, 2019

Uh oh!

tysonnorris commented Feb 15, 2019

Uh oh!

style95 commented May 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

codecov-io commented Feb 13, 2019 •

edited

Loading

tysonnorris Feb 15, 2019 •

edited

Loading

tysonnorris Feb 18, 2019 •

edited

Loading

tysonnorris Feb 22, 2019 •

edited

Loading