Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rollup] Add more diagnostic stats to job #35471

Merged
merged 5 commits into from Nov 27, 2018

Conversation

polyfractal
Copy link
Contributor

To help debug future performance issues, this adds the min/max/avg/count/total latencies (in milliseconds) for search and bulk phase. This latency is the total service time including transfer between nodes, not just the took time.

It also adds the count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster.

Also updates the HLRC with the new response elements.

/cc @hendrikmuhs This adds the stats to the IndexerJobStats superclass, although all the xcontent stuff is done in the Rollup implementation

To help debug future performance issues, this adds the
 min/max/avg/count/total latencies (in milliseconds) for search
and bulk phase.  This latency is the total service time including
transfer between nodes, not just the `took` time.

It also adds the count of search/bulk failures encountered during
runtime.  This information is also in the log, but a runtime counter
will help expose problems faster
@polyfractal polyfractal added >enhancement v7.0.0 :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v6.6.0 labels Nov 12, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

static final ParseField BULK_LATENCY = new ParseField("bulk_latency_in_ms");
static final ParseField SEARCH_LATENCY = new ParseField("search_latency_in_ms");
static final ParseField SEARCH_FAILURES = new ParseField("search_failures");
static final ParseField BULK_FAILURES = new ParseField("bulk_failures");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it would be nicer to call it INDEX_FAILURES, BULK is an implementation detail about how indexing is internally implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@hendrikmuhs
Copy link
Contributor

nice addition!

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good @polyfractal . I left some comments

static final ParseField MAX = new ParseField("max");
static final ParseField AVG = new ParseField("avg");
static final ParseField COUNT = new ParseField("count");
static final ParseField TOTAL = new ParseField("total");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the _stats API can we call these bulk_time_in_millis and query_time_in_millis ? I am also not sure if we need the min, the max and the avg. It should be enough to have the total time spent in these operations and the number of calls per action ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polyfractal Are MIN, ..., ..., TOTAL leftovers from previous iterations? They look unused to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right you are!

"trigger_count" : 0
"trigger_count" : 0,
"bulk_failures": 0,
"bulk_latency_in_ms": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this to:

"bulk_time_in_ms": 0,
"bulk_total": 0,
"search_time_in_ms": 0,
"search_total": 0

?
I don't think we need more than the total time and the number of invocations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can simplify these. :)

@@ -153,6 +153,7 @@ public synchronized boolean maybeTriggerAsyncJob(long now) {
// fire off the search. Note this is async, the method will return from here
executor.execute(() -> {
try {
stats.markStartSearch();
doNextSearch(buildSearchRequest(), ActionListener.wrap(this::onSearchResponse, exc -> finishWithFailure(exc)));
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing stats.incrementSearchFailures() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ good catch

Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@polyfractal polyfractal merged commit 48fa251 into elastic:master Nov 27, 2018
@polyfractal
Copy link
Contributor Author

Thanks @hendrikmuhs @jimczi!

polyfractal added a commit that referenced this pull request Nov 27, 2018
This adds some new statistics to the job to help with debugging
performance issues:

- Total search and index time (in milliseconds) encounteed by the indexer
during runtime.  This time is the total service time including
transfer between nodes, not just the `took` time.

- Total count of search and index requests.  Together with the total
times, this can be used to determine average request time.

- Count of search/bulk failures encountered during
runtime.  This information is also in the log, but a runtime counter
will help expose problems faster
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Nov 28, 2018
* master:
  DOCS Audit event attributes in new format (elastic#35510)
  Scripting: Actually add joda time back to whitelist (elastic#35965)
  [DOCS] fix HLRC ILM doc misreferenced tag
  Add realm information for Authenticate API (elastic#35648)
  [ILM] add HLRC docs to remove-policy-from-index (elastic#35759)
  [Rollup] Update serialization version after backport
  [Rollup] Add more diagnostic stats to job (elastic#35471)
  Build: Fix gradle build for Mac OS (elastic#35968)
  Adds deprecation logging to ScriptDocValues#getValues. (elastic#34279)
  [Monitoring] Make Exporters Async (elastic#35765)
  [ILM] reduce time restriction on IndexLifecycleExplainResponse (elastic#35954)
  Remove use of AbstractComponent in xpack (elastic#35394)
  Deprecate types in search and multi search templates. (elastic#35669)
  Remove fromXContent from IndexUpgradeInfoResponse (elastic#35934)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v6.6.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants