Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponentially backs off retry attempts for sending usage data #117955

Conversation

TinaHeiligers
Copy link
Contributor

@TinaHeiligers TinaHeiligers commented Nov 8, 2021

Resolves #115221
In some cases, a call to fetch all the usage data for a specific report time period might take a long time and add additional strain to a server already under a high load.

This PR exponential backs off retry fetch and send attempts within a report time window and changes the once-per-minute checks to only check if the next report is due.

In this PR we're changing the behavior of when we update the report date, from after a successful send request to before the first send request is attempted. Optimistically updating the report date before the calls avoids subsequent retries every minute.

This has the risk of ever increasing payloads that might eventually also put a large strain on the browser if any usage collector was previously relying on an at-most 24 hour usage data fetch window.

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Risk Probability Severity Mitigation/Notes
Browser becomes unresponsive;unexpected behavior for large usage data payloads. Low High usage collectors need to implement a time-period restriction on data collection

For maintainers

@TinaHeiligers TinaHeiligers added backport:skip This commit does not require backporting Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes v8.1.0 labels Nov 8, 2021
Copy link
Contributor Author

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

@@ -53,13 +53,22 @@ export class TelemetrySender {
return false;
};

private delayRetrySend = () => {
return 60 * (1000 * Math.min(Math.pow(2, this.retryCount), 64)); // 120s, 240s, 480s, 960s, 1920s, 3840s, 3840s, 3840s
Copy link
Contributor Author

@TinaHeiligers TinaHeiligers Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be written as 60000 * Math.min(.....) but writing it this way makes it easier to read at a glance.

We limit the backoff period to just over once per hour.

private sendIfDue = async (): Promise<void> => {
if (this.isSending || !this.shouldSendReport()) {
if (!this.shouldSendReport()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exit early if and only if the next report isn't due to be sent. This backs-off on the once-per minute fetch and send request calls we used to make.

return;
}
// optimistically update the report date and reset the retry counter for a new time report interval window
this.lastReported = `${Date.now()}`;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a behavior change here where we're now updating the report date before sending the data vs only after a successful send. We need to do this to ensure we're backing off the retry attempts.

src/plugins/telemetry/public/services/telemetry_sender.ts Outdated Show resolved Hide resolved
@TinaHeiligers TinaHeiligers marked this pull request as ready for review November 9, 2021 02:31
@TinaHeiligers TinaHeiligers requested a review from a team as a code owner November 9, 2021 02:31
private sendIfDue = async (): Promise<void> => {
if (this.isSending || !this.shouldSendReport()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the removal of isSending, could we add unit tests where sendIfDue is called multiple time in a row without delay just to be sure that shouldSendReport properly handles it (aka sendUsageData should only be called once)

Copy link
Contributor Author

@TinaHeiligers TinaHeiligers Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendIfDue is private and we’re the only ones calling the method. It will never be called without a delay.

Comment on lines 97 to 98
window.setTimeout(this.sendUsageData, this.delayRetrySend());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: maybe we want to add else { console.log(something) } to at least surface the error in the console logs? I feel like it's better than just trapping the error.

Copy link
Contributor Author

@TinaHeiligers TinaHeiligers Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +143 to +147
const telemetryService = mockTelemetryService();
const telemetrySender = new TelemetrySender(telemetryService);
telemetrySender['shouldSendReport'] = jest.fn().mockReturnValue(false);
telemetrySender['retryCount'] = 0;
await telemetrySender['sendIfDue']();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Thinking out loud, not related to the PR as it was already present) This is a code smell of bad isolation. Proper unit testing should never require to mock the blackboxed component.

e.g shouldSendReport mocking could be replaced with telemetryService.canSendTelemetry mocking (which would be the correct way to test the method's behavior btw). Here, we need to mock some of the TelemetrySender's methods to test the behavior of other methods of the class, which is not unit testing the class.

this.retryCount = this.retryCount + 1;
if (this.retryCount < 20) {
// exponentially backoff the time between subsequent retries to up to 19 attempts, after which we give up until the next report is due
window.setTimeout(this.sendUsageData, this.delayRetrySend());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT/optional: delayRetrySend can be extracted as a static function getRetryDelay(retryCount: number)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent suggestion, thank you! It also made the testing much easier.

@TinaHeiligers TinaHeiligers enabled auto-merge (squash) November 10, 2021 01:36
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
telemetry 23.9KB 24.2KB +327.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@TinaHeiligers TinaHeiligers merged commit ad1e2ad into elastic:main Nov 10, 2021
kpatticha pushed a commit to kpatticha/kibana that referenced this pull request Nov 10, 2021
@TinaHeiligers TinaHeiligers added v7.16.0 v8.0.0 auto-backport Deprecated: Automatically backport this PR after it's merged and removed backport:skip This commit does not require backporting labels Nov 17, 2021
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 17, 2021
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 17, 2021
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
8.0
7.16

The backport PRs will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Nov 17, 2021
… (#118900)

Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>
kibanamachine added a commit that referenced this pull request Nov 17, 2021
… (#118901)

Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>
@rudolf rudolf added the bug Fixes for quality problems that affect the customer experience label Nov 18, 2021
roeehub pushed a commit to build-security/kibana that referenced this pull request Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated: Automatically backport this PR after it's merged bug Fixes for quality problems that affect the customer experience Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes v7.16.0 v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Browser retries failed telemetry requests every 60s
4 participants