Skip to content

Comments

SOLR-14401: Track distrib/shard metrics differently#657

Merged
dsmiley merged 19 commits intoapache:mainfrom
dsmiley:solr-14401-distribMetrics
Mar 10, 2022
Merged

SOLR-14401: Track distrib/shard metrics differently#657
dsmiley merged 19 commits intoapache:mainfrom
dsmiley:solr-14401-distribMetrics

Conversation

@dsmiley
Copy link
Contributor

@dsmiley dsmiley commented Feb 18, 2022

  • only do for SearchHandler, not all request handlers
  • track all the same details at the shard level as request (more metrics)
  • don't limit this to SolrCloud

https://issues.apache.org/jira/browse/SOLR-14401

(I have some comments I'll post later)
(I have yet to updated affected tests; seems to be 1-2)

* only do for SearchHandler, not all request handlers
* track all the same details at the shard level as request (more metrics)
* don't limit this to SolrCloud
Copy link
Contributor

@HoustonPutman HoustonPutman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me. I don't have any particular opinion on the new naming, I see pros and cons, but if it works, it works.

Obviously we need some documentation around this, and I'm assuming that's coming next.

I think we need to change the default prometheus exporter config & graphana dashboard to use these new values, which will also give us a good indication if theres an issue with the way that the new naming is done.

Copy link
Contributor

@madrob madrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consolidate this with AuthenticationPlugin metrics as well?

@dsmiley
Copy link
Contributor Author

dsmiley commented Feb 23, 2022

Changes to the prometheus config are necessary. I want to follow some how-to on use of Grafana we have to ensure it looks right.
Also, performance-statistics-reference.adoc

@dsmiley
Copy link
Contributor Author

dsmiley commented Feb 23, 2022

Do we need to consolidate this with AuthenticationPlugin metrics as well?

Huh?

dsmiley added 5 commits March 3, 2022 08:25
* needn't select ".distrib." as this is the default semantic
* remove ".local." additions because these are already expressed via separate request handlers with a suffix
* time_seconds_total computed differently; looks suspicious
@dsmiley
Copy link
Contributor Author

dsmiley commented Mar 3, 2022

Definitely see JIRA for my overall comments.

I want to point out that I observed that the "totalTime" metric has been a nanosecond number in recent years, yet once upon a time it was milliseconds. This change was very likely inadvertent. Our prometheus solr-exporter-config.xml shows that it thinks it's milliseconds. It's not; RequestHandlerBase increments this counter by "elapsed", the response of timer.stop() -- nanoseconds. Years ago it had invoked MetricUtils.nsToMs( but it appears @sigram removed this as a part of other changes in 2017 sometime -- d8df9f8

@janhoy
Copy link
Contributor

janhoy commented Mar 3, 2022

Definitely see JIRA for my overall comments.

I want to point out that I observed that the "totalTime" metric has been a nanosecond number in recent years, yet once upon a time it was milliseconds. This change was very likely inadvertent. Our prometheus solr-exporter-config.xml shows that it thinks it's milliseconds. It's not; RequestHandlerBase increments this counter by "elapsed", the response of timer.stop() -- nanoseconds. Years ago it had invoked MetricUtils.nsToMs( but it appears @sigram removed this as a part of other changes in 2017 sometime -- d8df9f8

I have seen some very strange number in grafana dashboards for Solr with huge numbers. So I think this is a bug. Can you file a new JIRA for it?

@dsmiley
Copy link
Contributor Author

dsmiley commented Mar 3, 2022

@dsmiley
Copy link
Contributor Author

dsmiley commented Mar 4, 2022

Ready for review again. Pending:

  • updated CHANGES.txt, major-changes-in-solr-9.adoc
  • precommit fix / merge with main
  • updating the Solr ref guide

dsmiley added 4 commits March 4, 2022 18:19
# Conflicts:
#	solr/core/src/java/org/apache/solr/handler/RequestHandlerBase.java
#	solr/core/src/java/org/apache/solr/handler/component/SearchHandler.java
#	solr/core/src/java/org/apache/solr/util/SolrPluginUtils.java
@dsmiley
Copy link
Contributor Author

dsmiley commented Mar 8, 2022

I think it's ready finally. Please check out the ref guide explanation that I just pushed.

I think the outdated MetricsQueryTemplateTest ought to be replaced with a very different integration test that is more of a sanity check that certain metrics we expect to see are present. Such a test could be done via docker and it'd thus offer more/better test coverage that other aspects are working too.

Copy link
Contributor

@HoustonPutman HoustonPutman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but I haven't gone through and tested it yet. Might get time to do that tomorrow

main = project.ext.mainClass
classpath = sourceSets.main.runtimeClasspath
systemProperties = ["log4j.configurationFile":"file:conf/log4j2.xml"]
args = ["-f", "conf/solr-exporter-config.xml"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh we could also add this to the classpath above like we do in the bin script.

Though it's not something we need to do in this PR.

@HoustonPutman
Copy link
Contributor

HoustonPutman commented Mar 10, 2022

Ok found an issue, the core-query template isn't the only affected template for prometheus. core has metrics that contain [shard] as well, including (but not limited to):

  • solr_metrics_core_errors_total
  • solr_metrics_core_requests_total
  • solr_metrics_core_timeouts_total

It was an easy fix, adding the same code from the core-query template, but it also means that the internal=false tag will be added for other handlers that may be used internally, we just don't include the tag for it. Personally I think that's fine and it can be seen as a possible improvement for the future. (I don't think the graphana dashboard is necessarily impacted by this. We could certainly add the tag filter for these metrics, but I don't think its necessary for this PR)

I'll go ahead and push the fix and let you take a look @dsmiley. Otherwise I'm good with this. Feel free to revert and fix in another way if you prefer.

@dsmiley dsmiley merged commit a8ae853 into apache:main Mar 10, 2022
@dsmiley dsmiley deleted the solr-14401-distribMetrics branch March 10, 2022 22:41
@dsmiley
Copy link
Contributor Author

dsmiley commented Mar 10, 2022

Thanks Houston! Yeah I wondered about the other templates but I didn't notice that they wold process such cases.

dsmiley added a commit that referenced this pull request Mar 12, 2022
* only do for SearchHandler, not all request handlers (less metrics overall)
* track all the same details at the shard level as request (more detailed metrics)
* use [shard] suffix; do away with .distrib. and .local.
* don't limit this to SolrCloud

Prometheus Exporter & Grafana config:
* remove select ".distrib."; this is the default semantic
* remove ".local." additions because these are already expressed via separate request handlers with a suffix
* time_seconds_total computed differently; looks suspicious
* extract an "internal" Prometheus label from the handler; has values "shard" or "false".  Updated Grafana to use this to match former logic.

Misc:
* prometheus gradle: fix "run" task
* fix README link

Co-authored-by: Houston Putman <houston@apache.org>
dsmiley added a commit that referenced this pull request Mar 12, 2022
* only do for SearchHandler, not all request handlers (less metrics overall)
* track all the same details at the shard level as request (more detailed metrics)
* use [shard] suffix; do away with .distrib. and .local.
* don't limit this to SolrCloud

Prometheus Exporter & Grafana config:
* remove select ".distrib."; this is the default semantic
* remove ".local." additions because these are already expressed via separate request handlers with a suffix
* time_seconds_total computed differently; looks suspicious
* extract an "internal" Prometheus label from the handler; has values "shard" or "false".  Updated Grafana to use this to match former logic.

Misc:
* prometheus gradle: fix "run" task
* fix README link

Co-authored-by: Houston Putman <houston@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants