Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] Track event loop delays on the server #101580

Merged
merged 22 commits into from
Jun 22, 2021

Conversation

Bamieh
Copy link
Member

@Bamieh Bamieh commented Jun 8, 2021

Summary

Reports an array of daily histograms of delays in the event loop.

Uses perf_hooks.monitorEventLoopDelay to track the histogram with the default resolution of 10ms.

Event loop delays data is aggregated daily in a histogram per process ID.

The histogram is stored and updated in a saved object every 1 hour. This means that we'd maximum lose an hour worth of tracking on server shutdown.

Example payload

"event_loop_delays": {
  "daily": [
    {
      "min": 9093120,
      "max": 53247999,
      "mean": 11993238.600747818,
      "exceeds": 0,
      "stddev": 1168191.9357543814,
      "percentiles": {
        "50": 12607487,
        "75": 12615679,
        "95": 12648447,
        "99": 12713983
      },
      "processId": 55097
    },
    {
      "min": 9223372036854776000,
      "max": 0,
      "mean": null,
      "exceeds": 0,
      "stddev": null,
      "percentiles": {
        "50": 0,
        "75": 0,
        "95": 0,
        "99": 0
      },
      "processId": 54099
    },
  ]
}

Schema

export interface EventLoopDelaysUsageReport {
  daily: Array<{
    processId: number;
    min: number;
    max: number;
    mean: number;
    exceeds: number;
    stddev: number;
    percentiles: {
      '50': number;
      '75': number;
      '95': number;
      '99': number;
    };
  }>;
}

Notes

Once we have node v15.9 we can utilize the native Histogram provided by node to bake histograms into usage_counters to provide a histogram api in addition to the totals counter.

Context: #101283

@Bamieh Bamieh requested a review from mshustov June 8, 2021 12:05
@Bamieh Bamieh added Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes v7.14.0 v8.0.0 labels Jun 8, 2021
@Bamieh Bamieh changed the title initial implementation [Telemetry] Track event loop delays on the server Jun 8, 2021
@Bamieh Bamieh marked this pull request as ready for review June 9, 2021 12:30
@Bamieh Bamieh requested review from a team as code owners June 9, 2021 12:30
export function startTrackingEventLoopDelaysUsage(
internalRepository: ISavedObjectsRepository,
stopMonitoringEventLoop$: Subject<void>,
collectionStartDelay = MONITOR_EVENT_LOOP_DELAYS_START,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thus, we start collecting data 1 min after the start lifecycle of kibana_usage_collection plugin.

Copy link
Member Author

@Bamieh Bamieh Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The monitoring of the event loop starts immediately. The first collection of the histogram happens after 1 minute. The daily histogram data is updated every 1 hour.

}

export function serializeSavedObjectId({ date, pid }: { date: moment.MomentInput; pid: number }) {
const formattedDate = moment(date).format('DDMMYYYY');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment about moment usage when the standard library should do the trick:

const date = new Date();
console.log(`${date.getDate()}_${date.getMonth()}_${date.getFullYear()}`);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using moment for dates especially when it comes to formatting etc. .format('DDMMYYYY') is a lot more readable and less error prone than the suggestion above IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a lot more readable

I do agree it's more readable, I am just not sure that using moment for such simple functionality justifies the performance penalty. Yeah, it's not significant per se, but it's a road to death by a thousand cuts.

Copy link
Member Author

@Bamieh Bamieh Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally against focusing on micro optimizations. There are a lot of bigger performance bottlenecks in an i/o project like kibana.

I dont think kibana would ever suffer in performance because JS can execute a moment function 200k times per second while a native date would execute 800k times. I'd always favor readablity and maintanablity in these cases. Especially since we have moment integrated in every place where we use dates even in eui components.

I did however replace moment.now() with Date.now() following your advice

src/plugins/kibana_usage_collection/server/plugin.ts Outdated Show resolved Hide resolved
Comment on lines +96 to +98
eventLoopDelaysCollector.reset();
}
await storeHistogram(histogram, internalRepository);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: technically, being inside a subscription, I think this await does nothing. Not sure it would even protect against unresolved rejections?

Copy link
Member Author

@Bamieh Bamieh Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it is exactly the same as without await in this case since it is inside a .subscribe(..). I've just added it to explicitly denote that a Promise is expected here, also if we want to add some logic in the future under this function it might be easy to miss adding await

src/plugins/kibana_usage_collection/server/plugin.ts Outdated Show resolved Hide resolved
/**
* Roll daily indices every 24h
*/
export const ROLL_DAILY_INDICES_INTERVAL = 24 * 60 * 60 * 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: it's easy to forget updating the comment, so self-docuemented code FTW:

const Min = 60 * 1000;
const Hour = 60 * Min;
/** * Start rolling indices */
export const ROLL_DAILY_INDICES_INTERVAL = 24 * Hour
export const ROLL_INDICES_START = 5 * Min;

@lukeelmers
Copy link
Member

Was trying to think if there was any way this could break with the plan for clustering, but AFAICT we should be fine as long as the pid is always reported with the metrics.

@Bamieh Bamieh added the auto-backport Deprecated: Automatically backport this PR after it's merged label Jun 21, 2021
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Saved Objects .kibana field count

Every field in each saved object type adds overhead to Elasticsearch. Kibana needs to keep the total field count below Elasticsearch's default limit of 1000 fields. Only specify field mappings for the fields you wish to search on or query. See https://www.elastic.co/guide/en/kibana/master/development-plugin-saved-objects.html#_mappings

id before after diff
event_loop_delays_daily - 2 +2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@Bamieh Bamieh merged commit 42fc797 into elastic:master Jun 22, 2021
@Bamieh Bamieh deleted the telemetry/event_loop_delays branch June 22, 2021 03:21
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Jun 22, 2021
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated: Automatically backport this PR after it's merged Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes v7.14.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants