Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-15059: Improve query performance monitoring #2165

Merged
merged 12 commits into from Jan 7, 2021

Conversation

thelabdude
Copy link
Contributor

@thelabdude thelabdude commented Dec 23, 2020

Description

See JIRA: https://issues.apache.org/jira/browse/SOLR-15059 ... see a detailed description in the JIRA

Solution

Improve the Grafana dashboard to include graphs for monitoring query performance.

Tests

Manual testing of the Grafana dashboard in the browser while running query load.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Ref Guide (for Solr changes only).

@MarcusSorealheis
Copy link
Contributor

Is stripWs a term of art? Its definition is semi-obvious but will be even more obvious with a new name. The test files will live a long time and this PR is a key addition.

@epugh
Copy link
Contributor

epugh commented Dec 23, 2020

How does this compare/overlap with any of the dashboards on Grafana's site? https://grafana.com/grafana/dashboards?search=solr

Specifically, the one published by @janhoy at https://grafana.com/grafana/dashboards/12456?

Be nice if we had one widely used and supported Grafana dashboard!

@thelabdude
Copy link
Contributor Author

@epugh If you diff the dashboard you linked to with the version in master (https://github.com/apache/lucene-solr/blob/master/solr/contrib/prometheus-exporter/conf/grafana-solr-dashboard.json) you'll see they are the same.

This PR just enhances that dashboard with query performance metrics and changes a few of the defaults in how null / zero series are handled.

@@ -105,17 +108,23 @@ public static MetricsConfiguration from(String resource) throws Exception {
public static MetricsConfiguration from(Document config) throws Exception {
Node settings = getNode(config, "/config/settings");

Map<String,MetricsQueryTemplate> jqTemplatesMap = null;
NodeList jqTemplates = (NodeList)(xpathFactory.newXPath()).evaluate("/config/jq-templates/template", config, XPathConstants.NODESET);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noble just spent a bunch of effort getting rid of XPath in other places, is this a good direction to be going now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is already using XPath, I'm not introducing it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this happens once during initialization of the Prom exporter, so efficiency of XPath isn't so much of a concern. A few extra millis (if that much) during init doesn't seem like a problem to me, esp. if it makes the code cleaner.

// could be a simple field name or some kind of function here
if (!metric.contains("$")) {
if ("object.value".equals(metric)) {
metric = "$object.value"; // don't require the user to supply the $
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we trying to be too helpful here? Or is this matching some existing spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just trying to be helpful to keep the template syntax simpler ... e.g. I like {{count}} vs. {{$object.value.count}}

@thelabdude
Copy link
Contributor Author

Is stripWs a term of art? Its definition is semi-obvious but will be even more obvious with a new name. The test files will live a long time and this PR is a key addition.

@MarcusSorealheis Updated the name ;-)

@thelabdude thelabdude requested a review from janhoy January 5, 2021 22:18
@thelabdude
Copy link
Contributor Author

Pinged you for a review @janhoy if you're curious. Also, would appreciate guidance on updating the default dashboard with these changes once they are back ported to 8.8. Thanks in advance ;-)

@thelabdude thelabdude merged commit 8b55fb8 into apache:master Jan 7, 2021
thelabdude added a commit to thelabdude/lucene-solr that referenced this pull request Jan 7, 2021
ctargett pushed a commit to ctargett/lucene-solr that referenced this pull request Jan 11, 2021
epugh pushed a commit to epugh/lucene-solr-1 that referenced this pull request Jan 15, 2021
label_values : [$category, $handler],
value : $value
}
$jq:node(requests_total, select(.key | endswith(".local.requestTimes")), count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thelabdude I noticed here you added a ".local." when it wasn't there before. Why? And for that matter, maybe we needn't bother publishing this particular metric at all; I'm skeptical of the utility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsmiley it's to support the query charts showing core-level query metrics vs. top-level distributed query metrics added in this PR. I like knowing if there's an imbalance of core-level query requests going to certain replicas or if the load across all of my replicas is balanced. So you're skeptical but you haven't said why exactly? You want to change, then change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw ~ have you actually looked at the charts added in this PR in Grafana with query load running? If there's a problem there, then let's fix it and move forward but rehashing old decisions seems unproductive at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to comment on the "totalTime" metric w.r.t. it's usefulness; sorry for the confusion. It's some massive number of course... it'd need to be divided by something else to be useful? Also, totalTime is in nanoseconds lately! https://issues.apache.org/jira/browse/SOLR-16073

I understand the overarching objective of top-level vs core-level -- makes sense.
I'm a bit unclear on the distinction between the node level "$jq:node" metrics, and the "Local (non-distrib) query metrics", both of which are using ".local.".

RE Grafana; I haven't seen our official one in use live. I use our own/custom one at work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants