SOLR-15059: Improve query performance monitoring #2165

thelabdude · 2020-12-23T17:22:28Z

Description

See JIRA: https://issues.apache.org/jira/browse/SOLR-15059 ... see a detailed description in the JIRA

Solution

Improve the Grafana dashboard to include graphs for monitoring query performance.

Tests

Manual testing of the Grafana dashboard in the browser while running query load.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the master branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Ref Guide (for Solr changes only).

MarcusSorealheis · 2020-12-23T17:38:30Z

Is stripWs a term of art? Its definition is semi-obvious but will be even more obvious with a new name. The test files will live a long time and this PR is a key addition.

epugh · 2020-12-23T18:20:44Z

How does this compare/overlap with any of the dashboards on Grafana's site? https://grafana.com/grafana/dashboards?search=solr

Specifically, the one published by @janhoy at https://grafana.com/grafana/dashboards/12456?

Be nice if we had one widely used and supported Grafana dashboard!

thelabdude · 2020-12-23T18:35:49Z

@epugh If you diff the dashboard you linked to with the version in master (https://github.com/apache/lucene-solr/blob/master/solr/contrib/prometheus-exporter/conf/grafana-solr-dashboard.json) you'll see they are the same.

This PR just enhances that dashboard with query performance metrics and changes a few of the defaults in how null / zero series are handled.

...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQueryTemplate.java

madrob · 2020-12-23T17:36:01Z

...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsConfiguration.java

@@ -105,17 +108,23 @@ public static MetricsConfiguration from(String resource) throws Exception {
  public static MetricsConfiguration from(Document config) throws Exception {
    Node settings = getNode(config, "/config/settings");

+    Map<String,MetricsQueryTemplate> jqTemplatesMap = null;
+    NodeList jqTemplates = (NodeList)(xpathFactory.newXPath()).evaluate("/config/jq-templates/template", config, XPathConstants.NODESET);


Noble just spent a bunch of effort getting rid of XPath in other places, is this a good direction to be going now?

The code is already using XPath, I'm not introducing it here?

Also, this happens once during initialization of the Prom exporter, so efficiency of XPath isn't so much of a concern. A few extra millis (if that much) during init doesn't seem like a problem to me, esp. if it makes the code cleaner.

...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsConfiguration.java

solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQuery.java

...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsConfiguration.java

madrob · 2020-12-23T18:35:05Z

...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQueryTemplate.java

+    // could be a simple field name or some kind of function here
+    if (!metric.contains("$")) {
+      if ("object.value".equals(metric)) {
+        metric = "$object.value"; // don't require the user to supply the $


Are we trying to be too helpful here? Or is this matching some existing spec?

just trying to be helpful to keep the template syntax simpler ... e.g. I like {{count}} vs. {{$object.value.count}}

...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQueryTemplate.java

thelabdude · 2020-12-23T19:26:55Z

Is stripWs a term of art? Its definition is semi-obvious but will be even more obvious with a new name. The test files will live a long time and this PR is a key addition.

@MarcusSorealheis Updated the name ;-)

… add a panel showing number of leaders per node

...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQueryTemplate.java

thelabdude · 2021-01-05T22:19:43Z

Pinged you for a review @janhoy if you're curious. Also, would appreciate guidance on updating the default dashboard with these changes once they are back ported to 8.8. Thanks in advance ;-)

dsmiley · 2022-03-03T05:05:57Z

solr/contrib/prometheus-exporter/conf/solr-exporter-config.xml

-              label_values : [$category, $handler],
-              value        : $value
-            }
+            $jq:node(requests_total, select(.key | endswith(".local.requestTimes")), count)


@thelabdude I noticed here you added a ".local." when it wasn't there before. Why? And for that matter, maybe we needn't bother publishing this particular metric at all; I'm skeptical of the utility.

@dsmiley it's to support the query charts showing core-level query metrics vs. top-level distributed query metrics added in this PR. I like knowing if there's an imbalance of core-level query requests going to certain replicas or if the load across all of my replicas is balanced. So you're skeptical but you haven't said why exactly? You want to change, then change it.

fwiw ~ have you actually looked at the charts added in this PR in Grafana with query load running? If there's a problem there, then let's fix it and move forward but rehashing old decisions seems unproductive at this point.

I meant to comment on the "totalTime" metric w.r.t. it's usefulness; sorry for the confusion. It's some massive number of course... it'd need to be divided by something else to be useful? Also, totalTime is in nanoseconds lately! https://issues.apache.org/jira/browse/SOLR-16073

I understand the overarching objective of top-level vs core-level -- makes sense.
I'm a bit unclear on the distinction between the node level "$jq:node" metrics, and the "Local (non-distrib) query metrics", both of which are using ".local.".

RE Grafana; I haven't seen our official one in use live. I use our own/custom one at work.

SOLR-15059: Improve query performance monitoring

b32cb0e

Add unit test for extracting query metrics

0b4aab8

Back to [1m] for increase, inadvertently changed to [5m]

1e92c76

madrob reviewed Dec 23, 2020

View reviewed changes

thelabdude added 2 commits December 23, 2020 11:45

Fix broken test and remove unused getters

5c49dd2

Use StringUtils

f5a9143

thelabdude added 4 commits January 5, 2021 13:27

Merge remote-tracking branch 'asf/master' into SOLR-15059

98e3337

Update the key selectors for core-query template to be more clear and…

507656e

… add a panel showing number of leaders per node

Update test to reflect changes to templates

f4883ae

Remove auto-decoration around the KEYSELECTOR clause

34914b5

sonatype-lift bot reviewed Jan 5, 2021

View reviewed changes

...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQueryTemplate.java Outdated Show resolved Hide resolved

thelabdude requested a review from janhoy January 5, 2021 22:18

thelabdude added 3 commits January 5, 2021 15:21

Fix musedev issue

7a759ef

Merge remote-tracking branch 'asf/master' into SOLR-15059

1494cab

Update changes

0a580f8

thelabdude merged commit 8b55fb8 into apache:master Jan 7, 2021

thelabdude added a commit to thelabdude/lucene-solr that referenced this pull request Jan 7, 2021

SOLR-15059: Improve query performance monitoring (apache#2165)

defbdcc

thelabdude mentioned this pull request Jan 7, 2021

SOLR-15059: Improve query performance monitoring #2184

Merged

ctargett pushed a commit to ctargett/lucene-solr that referenced this pull request Jan 11, 2021

SOLR-15059: Improve query performance monitoring (apache#2165)

d09e8d9

epugh pushed a commit to epugh/lucene-solr-1 that referenced this pull request Jan 15, 2021

SOLR-15059: Improve query performance monitoring (apache#2165)

aa6c7cf

ctargett pushed a commit that referenced this pull request Jan 15, 2021

SOLR-15059: Improve query performance monitoring (#2165)

37080b9

dsmiley reviewed Mar 3, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-15059: Improve query performance monitoring #2165

SOLR-15059: Improve query performance monitoring #2165

thelabdude commented Dec 23, 2020 •

edited

MarcusSorealheis commented Dec 23, 2020

epugh commented Dec 23, 2020

thelabdude commented Dec 23, 2020

madrob Dec 23, 2020

thelabdude Dec 23, 2020

thelabdude Dec 23, 2020

madrob Dec 23, 2020

thelabdude Dec 23, 2020

thelabdude commented Dec 23, 2020

thelabdude commented Jan 5, 2021

dsmiley Mar 3, 2022

thelabdude Mar 3, 2022

thelabdude Mar 3, 2022

dsmiley Mar 3, 2022

SOLR-15059: Improve query performance monitoring #2165

SOLR-15059: Improve query performance monitoring #2165

Conversation

thelabdude commented Dec 23, 2020 • edited

Description

Solution

Tests

Checklist

MarcusSorealheis commented Dec 23, 2020

epugh commented Dec 23, 2020

thelabdude commented Dec 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thelabdude commented Dec 23, 2020

thelabdude commented Jan 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thelabdude commented Dec 23, 2020 •

edited