SOLR-14401: Track distrib/shard metrics differently#657
SOLR-14401: Track distrib/shard metrics differently#657dsmiley merged 19 commits intoapache:mainfrom
Conversation
* only do for SearchHandler, not all request handlers * track all the same details at the shard level as request (more metrics) * don't limit this to SolrCloud
HoustonPutman
left a comment
There was a problem hiding this comment.
These changes look good to me. I don't have any particular opinion on the new naming, I see pros and cons, but if it works, it works.
Obviously we need some documentation around this, and I'm assuming that's coming next.
I think we need to change the default prometheus exporter config & graphana dashboard to use these new values, which will also give us a good indication if theres an issue with the way that the new naming is done.
madrob
left a comment
There was a problem hiding this comment.
Do we need to consolidate this with AuthenticationPlugin metrics as well?
solr/core/src/java/org/apache/solr/handler/RequestHandlerBase.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/handler/RequestHandlerBase.java
Outdated
Show resolved
Hide resolved
|
Changes to the prometheus config are necessary. I want to follow some how-to on use of Grafana we have to ensure it looks right. |
Huh? |
* needn't select ".distrib." as this is the default semantic * remove ".local." additions because these are already expressed via separate request handlers with a suffix * time_seconds_total computed differently; looks suspicious
|
Definitely see JIRA for my overall comments. I want to point out that I observed that the "totalTime" metric has been a nanosecond number in recent years, yet once upon a time it was milliseconds. This change was very likely inadvertent. Our prometheus solr-exporter-config.xml shows that it thinks it's milliseconds. It's not; RequestHandlerBase increments this counter by "elapsed", the response of |
I have seen some very strange number in grafana dashboards for Solr with huge numbers. So I think this is a bug. Can you file a new JIRA for it? |
…shard" or "false". Updated Grafana to use this to match former logic.
|
Ready for review again. Pending:
|
# Conflicts: # solr/core/src/java/org/apache/solr/handler/RequestHandlerBase.java # solr/core/src/java/org/apache/solr/handler/component/SearchHandler.java # solr/core/src/java/org/apache/solr/util/SolrPluginUtils.java
|
I think it's ready finally. Please check out the ref guide explanation that I just pushed. I think the outdated MetricsQueryTemplateTest ought to be replaced with a very different integration test that is more of a sanity check that certain metrics we expect to see are present. Such a test could be done via docker and it'd thus offer more/better test coverage that other aspects are working too. |
HoustonPutman
left a comment
There was a problem hiding this comment.
This looks good to me, but I haven't gone through and tested it yet. Might get time to do that tomorrow
| main = project.ext.mainClass | ||
| classpath = sourceSets.main.runtimeClasspath | ||
| systemProperties = ["log4j.configurationFile":"file:conf/log4j2.xml"] | ||
| args = ["-f", "conf/solr-exporter-config.xml"] |
There was a problem hiding this comment.
Ahh we could also add this to the classpath above like we do in the bin script.
Though it's not something we need to do in this PR.
|
Ok found an issue, the
It was an easy fix, adding the same code from the I'll go ahead and push the fix and let you take a look @dsmiley. Otherwise I'm good with this. Feel free to revert and fix in another way if you prefer. |
|
Thanks Houston! Yeah I wondered about the other templates but I didn't notice that they wold process such cases. |
* only do for SearchHandler, not all request handlers (less metrics overall) * track all the same details at the shard level as request (more detailed metrics) * use [shard] suffix; do away with .distrib. and .local. * don't limit this to SolrCloud Prometheus Exporter & Grafana config: * remove select ".distrib."; this is the default semantic * remove ".local." additions because these are already expressed via separate request handlers with a suffix * time_seconds_total computed differently; looks suspicious * extract an "internal" Prometheus label from the handler; has values "shard" or "false". Updated Grafana to use this to match former logic. Misc: * prometheus gradle: fix "run" task * fix README link Co-authored-by: Houston Putman <houston@apache.org>
* only do for SearchHandler, not all request handlers (less metrics overall) * track all the same details at the shard level as request (more detailed metrics) * use [shard] suffix; do away with .distrib. and .local. * don't limit this to SolrCloud Prometheus Exporter & Grafana config: * remove select ".distrib."; this is the default semantic * remove ".local." additions because these are already expressed via separate request handlers with a suffix * time_seconds_total computed differently; looks suspicious * extract an "internal" Prometheus label from the handler; has values "shard" or "false". Updated Grafana to use this to match former logic. Misc: * prometheus gradle: fix "run" task * fix README link Co-authored-by: Houston Putman <houston@apache.org>
https://issues.apache.org/jira/browse/SOLR-14401
(I have some comments I'll post later)
(I have yet to updated affected tests; seems to be 1-2)