New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sampling: fix pubsub implementation #5126
Conversation
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪 |
589df06
to
c670d64
Compare
c670d64
to
bf62e28
Compare
The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API. It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard. Given that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by: - enforcing single-shard indices for sampled trace data streams - searching (now single-shard) backing indices individually In addition, we now use global checkpoints to bound searches, and use PIT (point in time) for paging through results. Querying underlying indices and global checkpoints requires an additional "monitor" index privilege.
bf62e28
to
845e77e
Compare
Simplify by just using direct searches with a rnage on _seq_no, using the most recently observed _seq_no value as the lower bound. We can do this within the loop as well (i.e. until there are no more results, or we've observed the global checkpoint.)
Refresh indices after observing an updated global checkpoint to ensure document visibility is correct up to the observed global checkpoint.
This pull request is now in conflicts. Could you fix it @axw? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks reasonable, but this is enough out of my realm of knowledge for me to not be able to provide a definitive 👍 / 👎 . I'd be happy to listen to a quick walkthrough of the code, or merge and deal with any potential issues that arise since the current implementation is broken.
This pull request is now in conflicts. Could you fix it @axw? 🙏
|
* sampling: fix pubsub implementation The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API. It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard. Given that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by: - enforcing single-shard indices for sampled trace data streams - searching (now single-shard) backing indices individually In addition, we now use global checkpoints to bound searches, and use PIT (point in time) for paging through results. Querying underlying indices and global checkpoints requires an additional "monitor" index privilege. * sampling/pubsub: remove PIT again Simplify by just using direct searches with a rnage on _seq_no, using the most recently observed _seq_no value as the lower bound. We can do this within the loop as well (i.e. until there are no more results, or we've observed the global checkpoint.) * sampling/pubsub: only query get metric from _stats * pubsub: force-refresh indices Refresh indices after observing an updated global checkpoint to ensure document visibility is correct up to the observed global checkpoint. * Update changelog * systemtest: fix spurious test failure (cherry picked from commit 94e3201) # Conflicts: # changelogs/head.asciidoc
* sampling: fix pubsub implementation (#5126) * sampling: fix pubsub implementation The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API. It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard. Given that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by: - enforcing single-shard indices for sampled trace data streams - searching (now single-shard) backing indices individually In addition, we now use global checkpoints to bound searches, and use PIT (point in time) for paging through results. Querying underlying indices and global checkpoints requires an additional "monitor" index privilege. * sampling/pubsub: remove PIT again Simplify by just using direct searches with a rnage on _seq_no, using the most recently observed _seq_no value as the lower bound. We can do this within the loop as well (i.e. until there are no more results, or we've observed the global checkpoint.) * sampling/pubsub: only query get metric from _stats * pubsub: force-refresh indices Refresh indices after observing an updated global checkpoint to ensure document visibility is correct up to the observed global checkpoint. * Update changelog * systemtest: fix spurious test failure (cherry picked from commit 94e3201) # Conflicts: # changelogs/head.asciidoc * Delete head.asciidoc Co-authored-by: Andrew Wilkins <axw@elastic.co>
* sampling: fix pubsub implementation The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API. It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard. Given that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by: - enforcing single-shard indices for sampled trace data streams - searching (now single-shard) backing indices individually In addition, we now use global checkpoints to bound searches, and use PIT (point in time) for paging through results. Querying underlying indices and global checkpoints requires an additional "monitor" index privilege. * sampling/pubsub: remove PIT again Simplify by just using direct searches with a rnage on _seq_no, using the most recently observed _seq_no value as the lower bound. We can do this within the loop as well (i.e. until there are no more results, or we've observed the global checkpoint.) * sampling/pubsub: only query get metric from _stats * pubsub: force-refresh indices Refresh indices after observing an updated global checkpoint to ensure document visibility is correct up to the observed global checkpoint. * Update changelog * systemtest: fix spurious test failure (cherry picked from commit 94e3201) # Conflicts: # apmpackage/apm/0.2.0/data_stream/sampled_traces/manifest.yml # changelogs/head.asciidoc # x-pack/apm-server/sampling/pubsub/pubsub.go # x-pack/apm-server/sampling/pubsub/pubsub_test.go # x-pack/apm-server/sampling/pubsub/pubsubtest/client.go
confirmed with SNAPSHOT |
Motivation/summary
The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API.
It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard.
Instead of waiting for the Changes API we could take the same approach as elastic/fleet-server#200, using
a new Elasticsearch API built for Fleet Server. The main issue here is that the API does not (currently) support data streams.
Given the above and that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by:
In addition, we now use global checkpoints to bound searches (ensuring all replicas have committed the documents). Querying underlying indices and global checkpoints will require an additional "monitor" index privilege.
Checklist
- [ ] Documentation has been updatedHow to test these changes
Related issues
Closes #5119