Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FOSS] hot fix and improvement #2207

Merged
merged 25 commits into from Dec 3, 2021
Merged

Conversation

mingkun2020
Copy link
Contributor

  1. Companion stack for MSK, MQ tests
  2. Throttling handle mechanism
  3. Option for adding pipeline name as testing stack prefix
  4. Update template runtime version

@mingkun2020 mingkun2020 changed the title Foss hot fix [FOSS] hot fix and improvement Nov 4, 2021
integration/combination/test_function_with_alias.py Outdated Show resolved Hide resolved
integration/helpers/stack.py Outdated Show resolved Hide resolved
integration/combination/test_function_with_sns.py Outdated Show resolved Hide resolved
integration/combination/test_api_with_authorizers.py Outdated Show resolved Hide resolved
integration/single/test_basic_api.py Outdated Show resolved Hide resolved
try:
return func(*args, **kwargs)
except exc:
sleep_time = random.uniform(0, math.pow(2, retry_attempt) * delay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for adding jitter?
Shouldn't sleep time be calculated by delay +/- jitter instead of a random int between 0 to max delay?
The current approach on a 10th retry can range from 0 to 512 seconds. This seems to be a large range and I'm not sure what that will accomplish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cloudformation'sdescribe_stack api that we use to check the stack status has hard limit(maybe 10 calls per sec) and cannot be changed. Since we run the tests in parallel(for example 10 thread in concurrent), we will have 10 calls in the same time(imagine a big call cluster in a small time frame). So we need to add jitter(add some randomness) to spread the calls over a large time frame to avoid the throttling. If we just use exponential back off, the call cluster will appear once again after exponential time and won't solve the issue, therefore we need the jitter.

delay +/- jitter similar to the Equal jitter and the one I used here range from 0 to exponential delay is a Full jitter. They don't have much different in performance. More detail can be referred from this blog: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the final delay generated with a random between 0 to max delay instead of plus minus a percentage of the max delay?
In the example of 10th retry, it could have only 1 second delay which defeats the purpose of exponential backoff.

Copy link
Contributor

@CoshUS CoshUS Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Green as minimal, Red as maximum, and Purple as average.
This is the current implementation:
chrome_f5Cv4MNo4W
The minimal seems off to me and a really large range as retry count increases.

With a percentage based jitter:
chrome_SX71SQbQRJ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our goal is trying to spread the api calls among the test time frame as even as possible(If we use delay + jitter we might have some time block with no call). Thinking now how to do some simulation about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking:

Current implementation is following the blog post: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

One assumption the blog post makes is that the calls will continue to happen until the job is complete. However, in our case, we will fail the test if it fails after an x amount of retries. Although the total amount of "Work" is less, there could be a case where a job rolls low numbers for 5 times and fails the execution.
I think there is a better solution for our use case than either full jitter or equal jitter.

@mingkun2020 will work on a simulator for our use case specifically to test out different approaches.

This comment is not blocking.

Copy link
Contributor

@CoshUS CoshUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing the comments!

@codecov-commenter
Copy link

codecov-commenter commented Nov 25, 2021

Codecov Report

Merging #2207 (aeeb334) into develop (0bc383f) will decrease coverage by 0.10%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2207      +/-   ##
===========================================
- Coverage    94.43%   94.33%   -0.11%     
===========================================
  Files           95       95              
  Lines         6558     6598      +40     
  Branches      1325     1331       +6     
===========================================
+ Hits          6193     6224      +31     
- Misses         169      179      +10     
+ Partials       196      195       -1     
Impacted Files Coverage Δ
samtranslator/model/api/api_generator.py 93.24% <0.00%> (-1.42%) ⬇️
samtranslator/swagger/swagger.py 93.27% <0.00%> (-0.73%) ⬇️
samtranslator/model/api/http_api_generator.py 91.21% <0.00%> (-0.63%) ⬇️
samtranslator/model/lambda_.py 93.10% <0.00%> (ø)
samtranslator/model/apigateway.py 97.69% <0.00%> (ø)
samtranslator/model/sam_resources.py 92.35% <0.00%> (ø)
...translator/plugins/api/implicit_rest_api_plugin.py 100.00% <0.00%> (ø)
samtranslator/model/eventsources/push.py 92.33% <0.00%> (+0.02%) ⬆️
...translator/plugins/api/implicit_http_api_plugin.py 100.00% <0.00%> (+2.40%) ⬆️
samtranslator/model/eventsources/pull.py 92.89% <0.00%> (+2.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bc383f...aeeb334. Read the comment docs.

@@ -1,5 +1,7 @@
import hashlib

import pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this import used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import, removed it.

deployments = self.client_provider.code_deploy_client.list_deployments()["deployments"]
deployments = self.client_provider.code_deploy_client.list_deployments(
applicationName=application_name, deploymentGroupName=deployment_group
)["deployments"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For client responses like this are we catching any possible client errors? Also if the client returned a None response due to some error ["deployments"] would also be error prone? (Maybe get("deployments") would be a better option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it returns a None, it will break and the test will failed as expected(it shouldn't be none).

@mingkun2020 mingkun2020 dismissed mndeveci’s stale review December 3, 2021 00:22

Reviewer is busy on other tasks and this pr already get get 2 approvals

@mingkun2020 mingkun2020 merged commit d2cf4b7 into aws:develop Dec 3, 2021
@mingkun2020 mingkun2020 deleted the foss-hot-fix branch December 3, 2021 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants