New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-15059: Improve query performance monitoring #2165
Conversation
Is |
How does this compare/overlap with any of the dashboards on Grafana's site? https://grafana.com/grafana/dashboards?search=solr Specifically, the one published by @janhoy at https://grafana.com/grafana/dashboards/12456? Be nice if we had one widely used and supported Grafana dashboard! |
@epugh If you diff the dashboard you linked to with the version in master (https://github.com/apache/lucene-solr/blob/master/solr/contrib/prometheus-exporter/conf/grafana-solr-dashboard.json) you'll see they are the same. This PR just enhances that dashboard with query performance metrics and changes a few of the defaults in how null / zero series are handled. |
...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQueryTemplate.java
Show resolved
Hide resolved
@@ -105,17 +108,23 @@ public static MetricsConfiguration from(String resource) throws Exception { | |||
public static MetricsConfiguration from(Document config) throws Exception { | |||
Node settings = getNode(config, "/config/settings"); | |||
|
|||
Map<String,MetricsQueryTemplate> jqTemplatesMap = null; | |||
NodeList jqTemplates = (NodeList)(xpathFactory.newXPath()).evaluate("/config/jq-templates/template", config, XPathConstants.NODESET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noble just spent a bunch of effort getting rid of XPath in other places, is this a good direction to be going now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is already using XPath, I'm not introducing it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this happens once during initialization of the Prom exporter, so efficiency of XPath isn't so much of a concern. A few extra millis (if that much) during init doesn't seem like a problem to me, esp. if it makes the code cleaner.
...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsConfiguration.java
Outdated
Show resolved
Hide resolved
solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQuery.java
Show resolved
Hide resolved
...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsConfiguration.java
Show resolved
Hide resolved
...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsConfiguration.java
Outdated
Show resolved
Hide resolved
// could be a simple field name or some kind of function here | ||
if (!metric.contains("$")) { | ||
if ("object.value".equals(metric)) { | ||
metric = "$object.value"; // don't require the user to supply the $ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we trying to be too helpful here? Or is this matching some existing spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just trying to be helpful to keep the template syntax simpler ... e.g. I like {{count}} vs. {{$object.value.count}}
...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQueryTemplate.java
Outdated
Show resolved
Hide resolved
...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQueryTemplate.java
Outdated
Show resolved
Hide resolved
@MarcusSorealheis Updated the name ;-) |
… add a panel showing number of leaders per node
...b/prometheus-exporter/src/java/org/apache/solr/prometheus/exporter/MetricsQueryTemplate.java
Outdated
Show resolved
Hide resolved
Pinged you for a review @janhoy if you're curious. Also, would appreciate guidance on updating the default dashboard with these changes once they are back ported to 8.8. Thanks in advance ;-) |
label_values : [$category, $handler], | ||
value : $value | ||
} | ||
$jq:node(requests_total, select(.key | endswith(".local.requestTimes")), count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thelabdude I noticed here you added a ".local." when it wasn't there before. Why? And for that matter, maybe we needn't bother publishing this particular metric at all; I'm skeptical of the utility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsmiley it's to support the query charts showing core-level query metrics vs. top-level distributed query metrics added in this PR. I like knowing if there's an imbalance of core-level query requests going to certain replicas or if the load across all of my replicas is balanced. So you're skeptical but you haven't said why exactly? You want to change, then change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw ~ have you actually looked at the charts added in this PR in Grafana with query load running? If there's a problem there, then let's fix it and move forward but rehashing old decisions seems unproductive at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to comment on the "totalTime" metric w.r.t. it's usefulness; sorry for the confusion. It's some massive number of course... it'd need to be divided by something else to be useful? Also, totalTime is in nanoseconds lately! https://issues.apache.org/jira/browse/SOLR-16073
I understand the overarching objective of top-level vs core-level -- makes sense.
I'm a bit unclear on the distinction between the node level "$jq:node" metrics, and the "Local (non-distrib) query metrics", both of which are using ".local.".
RE Grafana; I haven't seen our official one in use live. I use our own/custom one at work.
Description
See JIRA: https://issues.apache.org/jira/browse/SOLR-15059 ... see a detailed description in the JIRA
Solution
Improve the Grafana dashboard to include graphs for monitoring query performance.
Tests
Manual testing of the Grafana dashboard in the browser while running query load.
Checklist
Please review the following and check all that apply:
master
branch../gradlew check
.