-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track Lucene field usage #74227
Track Lucene field usage #74227
Conversation
3337245
to
38b09e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you are checking the flags that are used in TermsEnum#postings
rather than wrapping the PostingsEnum
, which helps avoid adding a wrapper class over something that is called in tight loops.
Do we need to actually check whether impacts are used? As a Lucene developer, I'd be curious about this, but I don't know if users would be since this isn't actionable (Lucene doesn't allow you to disable indexing impacts).
server/src/main/java/org/elasticsearch/index/search/stats/ShardFieldUsageTracker.java
Outdated
Show resolved
Hide resolved
Good point. I wasn't sure whether those were configurable or not. I've removed them. |
Pinging @elastic/es-search (Team:Search) |
If I read the PR correctly, it currently returns raw counts which are impacted by the number of shards. E.g. a 2-shards index would have 2x the values of a single-shard index. Should we automatically divide the counts for one index by the number of shards it has so that numbers across different indices are easier to compare? |
@@ -615,8 +616,11 @@ private GetResult getFromTranslog(Get get, Translog.Index index, MappingLookup m | |||
final Engine.Searcher searcher = new Engine.Searcher("realtime_get", ElasticsearchDirectoryReader.wrap(inMemoryReader, shardId), | |||
config().getSimilarity(), config().getQueryCache(), config().getQueryCachingPolicy(), inMemoryReader); | |||
final Searcher wrappedSearcher = searcherWrapper.apply(searcher); | |||
if (wrappedSearcher == searcher) { | |||
searcher.close(); | |||
// TODO: we always wrap with field usage tracking, so we need to find another way to determine if we can safely do this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dnhatn would love to have your thoughts on this ^^
I'm wondering if we could have TranslogLeafReader
somehow embedded in SingleDocDirectoryReader
so that we always apply the wrapper, but then lazily load the InMemoryLeafReader
when there is access to more than what the TranslogLeafReader
can offer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've built a solution for this. I won't push it to this PR though as it is quite a big change to how TranslogLeafReader works. Will make a separate PR tomorrow if test show all is fine with that solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #74722 for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me overall.
One question I wonder about is whether the response should return stats per-shard instead of merging them. If we were to dogfood this by indexing it into Elasticsearch to track field usage over time, having per-shard information would be better as we could more easily figure when resets happen due to shard relocation or node shutdowns. Said otherwise this could be treated as a source of metrics where the index name, shard number, field name and data structure are dimensions, and the number of times it's been used is a counter.
"doc_values" : 1, | ||
"points" : 0, | ||
"norms" : 1, | ||
"term_vectors" : 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have an object for all data structures for consitency, e.g.
"all_fields": {
"any": "6",
"inverted_index": {
"terms" : 1,
"postings" : 1,
"proximity" : 1,
"positions" : 0,
"term_frequencies" : 1,
"offsets" : 0,
"payloads" : 0
},
"stored_fields" : {
"any": 2
},
[...]
Good question. In that case, tracking per shard would not even be enough as we have multiple replicas for which we are already summing up the stats (those can come and go as well). It would have to be tracking per some kind of "shard copy". We suffer here from the same issue as other existing stats, e.g., |
Reading from translog during a realtime get requires special handling in some higher level components, e.g. ShardGetService, where we're doing a bunch of tricks to extract other stored fields from the source. Another issue with the current approach relates to #74227 where we introduce a new "field usage tracking" directory wrapper that's always applied, and we want to make sure that we can still quickly do realtime gets from translog without creating an in-memory index of the document, even when this directory wrapper exists. This PR introduces a directory reader that contains a single translog indexing operation. This can be used during a realtime get to access documents that haven't been refreshed yet. In the normal case, all information relevant to resolve the realtime get is mocked out to provide fast access to _id and _source. In case where more values are requested (e.g. access to other stored fields) etc., this reader will index the document into an in-memory Lucene segment that is created on-demand. Relates #64504
Reading from translog during a realtime get requires special handling in some higher level components, e.g. ShardGetService, where we're doing a bunch of tricks to extract other stored fields from the source. Another issue with the current approach relates to #74227 where we introduce a new "field usage tracking" directory wrapper that's always applied, and we want to make sure that we can still quickly do realtime gets from translog without creating an in-memory index of the document, even when this directory wrapper exists. This PR introduces a directory reader that contains a single translog indexing operation. This can be used during a realtime get to access documents that haven't been refreshed yet. In the normal case, all information relevant to resolve the realtime get is mocked out to provide fast access to _id and _source. In case where more values are requested (e.g. access to other stored fields) etc., this reader will index the document into an in-memory Lucene segment that is created on-demand. Relates #64504
My current thinking is to change the API for now to not aggregate the data beyond the level of shard copies. As there is not a good identifier for shard copies (allocation id of a primary shard e.g. stays same after node restart), I would go with just a UUID for each shard instance, similar to how we have done it in the repository metering API ( It would look like the following:
|
@jpountz this is ready for another review :) |
Adds a field usage API that reports shard-level statistics about which Lucene fields have been accessed, and which parts of the Lucene data structures have been accessed. Field usage statistics are automatically captured when queries are runnning on a cluster. A shard-level search request that accesses a given field, even if multiple times during that request, is counted as a single use.
Adapts BWC conditions after backporting #74227 to 7.x branch (targeting 7.15).
Adds a field usage API that reports shard-level statistics about which Lucene fields have been accessed, and which parts of the Lucene data structures have been accessed. Field usage statistics are automatically captured when queries are runnning on a cluster. A shard-level search request that accesses a given field, even if multiple times during that request, is counted as a single use.
Adapts BWC conditions after backporting elastic#74227 to 7.x branch (targeting 7.15).
Adds a field usage API that reports shard-level statistics about which Lucene fields have been accessed, and which parts of the Lucene data structures have been accessed.
Field usage statistics are automatically captured when queries are runnning on a cluster. A shard-level search request that accesses a given field, even if multiple times during that request, is counted as a single use.
I've benchmarked this change on the
http_logs
track and could not see any substantial changes in performance. I think we will only get full validation of this by merging this change and having our nightly benchmarks catch any changes in performance.Relates #73944