[Rollup] Add more diagnostic stats to job #35471

polyfractal · 2018-11-12T19:58:19Z

To help debug future performance issues, this adds the min/max/avg/count/total latencies (in milliseconds) for search and bulk phase. This latency is the total service time including transfer between nodes, not just the took time.

It also adds the count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster.

Also updates the HLRC with the new response elements.

/cc @hendrikmuhs This adds the stats to the IndexerJobStats superclass, although all the xcontent stuff is done in the Rollup implementation

To help debug future performance issues, this adds the min/max/avg/count/total latencies (in milliseconds) for search and bulk phase. This latency is the total service time including transfer between nodes, not just the `took` time. It also adds the count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster

elasticmachine · 2018-11-12T19:58:21Z

Pinging @elastic/es-search-aggs

hendrikmuhs · 2018-11-13T13:10:57Z

client/rest-high-level/src/main/java/org/elasticsearch/client/rollup/GetRollupJobResponse.java

+    static final ParseField BULK_LATENCY = new ParseField("bulk_latency_in_ms");
+    static final ParseField SEARCH_LATENCY = new ParseField("search_latency_in_ms");
+    static final ParseField SEARCH_FAILURES = new ParseField("search_failures");
+    static final ParseField BULK_FAILURES = new ParseField("bulk_failures");


Nit: I think it would be nicer to call it INDEX_FAILURES, BULK is an implementation detail about how indexing is internally implemented.

hendrikmuhs · 2018-11-13T13:18:47Z

nice addition!

jimczi

The change looks good @polyfractal . I left some comments

jimczi · 2018-11-15T00:14:40Z

client/rest-high-level/src/main/java/org/elasticsearch/client/rollup/GetRollupJobResponse.java

+    static final ParseField MAX = new ParseField("max");
+    static final ParseField AVG = new ParseField("avg");
+    static final ParseField COUNT = new ParseField("count");
+    static final ParseField TOTAL = new ParseField("total");


To be consistent with the _stats API can we call these bulk_time_in_millis and query_time_in_millis ? I am also not sure if we need the min, the max and the avg. It should be enough to have the total time spent in these operations and the number of calls per action ?

@polyfractal Are MIN, ..., ..., TOTAL leftovers from previous iterations? They look unused to me.

Right you are!

jimczi · 2018-11-15T00:18:32Z

docs/reference/rollup/apis/get-job.asciidoc

-            "trigger_count" : 0
+            "trigger_count" : 0,
+            "bulk_failures": 0,
+            "bulk_latency_in_ms": {


Can we simplify this to:

"bulk_time_in_ms": 0, "bulk_total": 0, "search_time_in_ms": 0, "search_total": 0

?
I don't think we need more than the total time and the number of invocations.

Sure, I can simplify these. :)

jimczi · 2018-11-15T00:20:17Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexer.java

@@ -153,6 +153,7 @@ public synchronized boolean maybeTriggerAsyncJob(long now) {
                // fire off the search. Note this is async, the method will return from here
                executor.execute(() -> {
                    try {
+                        stats.markStartSearch();
                        doNextSearch(buildSearchRequest(), ActionListener.wrap(this::onSearchResponse, exc -> finishWithFailure(exc)));
                    } catch (Exception e) {


missing stats.incrementSearchFailures() ?

++ good catch

hendrikmuhs

LGTM

polyfractal · 2018-11-27T20:47:46Z

Thanks @hendrikmuhs @jimczi!

This adds some new statistics to the job to help with debugging performance issues: - Total search and index time (in milliseconds) encounteed by the indexer during runtime. This time is the total service time including transfer between nodes, not just the `took` time. - Total count of search and index requests. Together with the total times, this can be used to determine average request time. - Count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster

* master: DOCS Audit event attributes in new format (elastic#35510) Scripting: Actually add joda time back to whitelist (elastic#35965) [DOCS] fix HLRC ILM doc misreferenced tag Add realm information for Authenticate API (elastic#35648) [ILM] add HLRC docs to remove-policy-from-index (elastic#35759) [Rollup] Update serialization version after backport [Rollup] Add more diagnostic stats to job (elastic#35471) Build: Fix gradle build for Mac OS (elastic#35968) Adds deprecation logging to ScriptDocValues#getValues. (elastic#34279) [Monitoring] Make Exporters Async (elastic#35765) [ILM] reduce time restriction on IndexLifecycleExplainResponse (elastic#35954) Remove use of AbstractComponent in xpack (elastic#35394) Deprecate types in search and multi search templates. (elastic#35669) Remove fromXContent from IndexUpgradeInfoResponse (elastic#35934)

$polyfractal$

$@polyfractal$ polyfractal added >enhancement v7.0.0 :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v6.6.0 labels Nov 12, 2018

hendrikmuhs reviewed Nov 13, 2018

View reviewed changes

$@polyfractal$ polyfractal requested review from hendrikmuhs and jimczi November 14, 2018 15:03

jimczi reviewed Nov 15, 2018

View reviewed changes

polyfractal added 4 commits November 21, 2018 15:08

$@polyfractal$

review cleanup

6505b73

$@polyfractal$

Merge remote-tracking branch 'origin/master' into rollup_more_stats

6b25535

$@polyfractal$

Remove dead ParseFields

1856fbe

$@polyfractal$

Merge remote-tracking branch 'origin/master' into rollup_more_stats

98e90ed

hendrikmuhs approved these changes Nov 26, 2018

View reviewed changes

$@polyfractal$ polyfractal merged commit 48fa251 into elastic:master Nov 27, 2018

hendrikmuhs mentioned this pull request Nov 28, 2018

[ML-DataFrame] add a stats endpoint #35911

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

codebrain mentioned this pull request May 21, 2019

Additional Rollup Stats elastic/elasticsearch-net#3759

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rollup] Add more diagnostic stats to job #35471

[Rollup] Add more diagnostic stats to job #35471

$@polyfractal$ polyfractal commented Nov 12, 2018

elasticmachine commented Nov 12, 2018

hendrikmuhs Nov 13, 2018

$@polyfractal$ polyfractal Nov 13, 2018

hendrikmuhs commented Nov 13, 2018

jimczi left a comment

jimczi Nov 15, 2018

hendrikmuhs Nov 22, 2018 •

edited

$@polyfractal$ polyfractal Nov 26, 2018

jimczi Nov 15, 2018

$@polyfractal$ polyfractal Nov 19, 2018

jimczi Nov 15, 2018

$@polyfractal$ polyfractal Nov 19, 2018

hendrikmuhs left a comment

polyfractal commented Nov 27, 2018

[Rollup] Add more diagnostic stats to job #35471

[Rollup] Add more diagnostic stats to job #35471

Conversation

polyfractal commented Nov 12, 2018

elasticmachine commented Nov 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmuhs commented Nov 13, 2018

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmuhs Nov 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmuhs left a comment

Choose a reason for hiding this comment

polyfractal commented Nov 27, 2018

$@polyfractal$ polyfractal commented Nov 12, 2018

hendrikmuhs Nov 22, 2018 •

edited