Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampling: fix pubsub implementation #5126

Merged
merged 14 commits into from May 26, 2021
Merged

Conversation

axw
Copy link
Member

@axw axw commented Apr 20, 2021

Motivation/summary

The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API.
It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard.

Instead of waiting for the Changes API we could take the same approach as elastic/fleet-server#200, using
a new Elasticsearch API built for Fleet Server. The main issue here is that the API does not (currently) support data streams.

Given the above and that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by:

  • enforcing single-shard indices for sampled trace data streams
  • searching (now single-shard) backing indices individually

In addition, we now use global checkpoints to bound searches (ensuring all replicas have committed the documents). Querying underlying indices and global checkpoints will require an additional "monitor" index privilege.

Checklist

How to test these changes

  1. Run two APM Servers with tail-based sampling enabled
  2. Create some transactions
  3. Force a rollover of the sampled traces data stream, to create a new index
  4. Create some more transactions
  5. Check transactions are indexed according to the sampling config

Related issues

Closes #5119

@apmmachine
Copy link
Collaborator

apmmachine commented Apr 20, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #5126 updated

  • Start Time: 2021-05-26T04:39:18.067+0000

  • Duration: 39 min 24 sec

  • Commit: 90526c6

Test stats 🧪

Test Results
Failed 0
Passed 6262
Skipped 120
Total 6382

Trends 🧪

Image of Build Times

Image of Tests

@axw axw force-pushed the sampling-pubsub-checkpoints branch from 589df06 to c670d64 Compare April 20, 2021 09:47
@axw axw added the v7.14.0 label Apr 20, 2021
@axw axw force-pushed the sampling-pubsub-checkpoints branch from c670d64 to bf62e28 Compare April 20, 2021 11:52
The initial implementation was written as a ~quick hack, with the
expectation that it would be replaced by the Changes API. It was
broken due to its ignorance of data streams, and multi-shard indices.
Sequence numbers are only comparable within a single shard.

Given that there is no known delivery date for the Changes API,
we propose to instead revise the pubsub implementation to address
the problems by:

 - enforcing single-shard indices for sampled trace data streams
 - searching (now single-shard) backing indices individually

In addition, we now use global checkpoints to bound searches, and
use PIT (point in time) for paging through results. Querying underlying
indices and global checkpoints requires an additional "monitor" index
privilege.
@axw axw force-pushed the sampling-pubsub-checkpoints branch from bf62e28 to 845e77e Compare April 20, 2021 11:55
axw added 2 commits April 22, 2021 13:39
Simplify by just using direct searches with a rnage on _seq_no,
using the most recently observed _seq_no value as the lower bound.
We can do this within the loop as well (i.e. until there are no
more results, or we've observed the global checkpoint.)
axw added 2 commits April 27, 2021 12:12
Refresh indices after observing an updated global checkpoint
to ensure document visibility is correct up to the observed
global checkpoint.
@axw axw marked this pull request as ready for review April 27, 2021 07:38
@axw axw requested a review from a team April 27, 2021 07:38
@mergify
Copy link
Contributor

mergify bot commented May 18, 2021

This pull request is now in conflicts. Could you fix it @axw? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b sampling-pubsub-checkpoints upstream/sampling-pubsub-checkpoints
git merge upstream/master
git push upstream sampling-pubsub-checkpoints

Copy link
Contributor

@stuartnelson3 stuartnelson3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks reasonable, but this is enough out of my realm of knowledge for me to not be able to provide a definitive 👍 / 👎 . I'd be happy to listen to a quick walkthrough of the code, or merge and deal with any potential issues that arise since the current implementation is broken.

@mergify
Copy link
Contributor

mergify bot commented May 25, 2021

This pull request is now in conflicts. Could you fix it @axw? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b sampling-pubsub-checkpoints upstream/sampling-pubsub-checkpoints
git merge upstream/master
git push upstream sampling-pubsub-checkpoints

@axw axw merged commit 94e3201 into elastic:master May 26, 2021
@axw axw deleted the sampling-pubsub-checkpoints branch May 26, 2021 09:21
mergify bot pushed a commit that referenced this pull request May 26, 2021
* sampling: fix pubsub implementation

The initial implementation was written as a ~quick hack, with the
expectation that it would be replaced by the Changes API. It was
broken due to its ignorance of data streams, and multi-shard indices.
Sequence numbers are only comparable within a single shard.

Given that there is no known delivery date for the Changes API,
we propose to instead revise the pubsub implementation to address
the problems by:

 - enforcing single-shard indices for sampled trace data streams
 - searching (now single-shard) backing indices individually

In addition, we now use global checkpoints to bound searches, and
use PIT (point in time) for paging through results. Querying underlying
indices and global checkpoints requires an additional "monitor" index
privilege.

* sampling/pubsub: remove PIT again

Simplify by just using direct searches with a rnage on _seq_no,
using the most recently observed _seq_no value as the lower bound.
We can do this within the loop as well (i.e. until there are no
more results, or we've observed the global checkpoint.)

* sampling/pubsub: only query get metric from _stats

* pubsub: force-refresh indices

Refresh indices after observing an updated global checkpoint
to ensure document visibility is correct up to the observed
global checkpoint.

* Update changelog

* systemtest: fix spurious test failure

(cherry picked from commit 94e3201)

# Conflicts:
#	changelogs/head.asciidoc
axw added a commit that referenced this pull request May 26, 2021
* sampling: fix pubsub implementation (#5126)

* sampling: fix pubsub implementation

The initial implementation was written as a ~quick hack, with the
expectation that it would be replaced by the Changes API. It was
broken due to its ignorance of data streams, and multi-shard indices.
Sequence numbers are only comparable within a single shard.

Given that there is no known delivery date for the Changes API,
we propose to instead revise the pubsub implementation to address
the problems by:

 - enforcing single-shard indices for sampled trace data streams
 - searching (now single-shard) backing indices individually

In addition, we now use global checkpoints to bound searches, and
use PIT (point in time) for paging through results. Querying underlying
indices and global checkpoints requires an additional "monitor" index
privilege.

* sampling/pubsub: remove PIT again

Simplify by just using direct searches with a rnage on _seq_no,
using the most recently observed _seq_no value as the lower bound.
We can do this within the loop as well (i.e. until there are no
more results, or we've observed the global checkpoint.)

* sampling/pubsub: only query get metric from _stats

* pubsub: force-refresh indices

Refresh indices after observing an updated global checkpoint
to ensure document visibility is correct up to the observed
global checkpoint.

* Update changelog

* systemtest: fix spurious test failure

(cherry picked from commit 94e3201)

# Conflicts:
#	changelogs/head.asciidoc

* Delete head.asciidoc

Co-authored-by: Andrew Wilkins <axw@elastic.co>
@stuartnelson3 stuartnelson3 self-assigned this Jul 8, 2021
mergify bot pushed a commit that referenced this pull request Jul 8, 2021
* sampling: fix pubsub implementation

The initial implementation was written as a ~quick hack, with the
expectation that it would be replaced by the Changes API. It was
broken due to its ignorance of data streams, and multi-shard indices.
Sequence numbers are only comparable within a single shard.

Given that there is no known delivery date for the Changes API,
we propose to instead revise the pubsub implementation to address
the problems by:

 - enforcing single-shard indices for sampled trace data streams
 - searching (now single-shard) backing indices individually

In addition, we now use global checkpoints to bound searches, and
use PIT (point in time) for paging through results. Querying underlying
indices and global checkpoints requires an additional "monitor" index
privilege.

* sampling/pubsub: remove PIT again

Simplify by just using direct searches with a rnage on _seq_no,
using the most recently observed _seq_no value as the lower bound.
We can do this within the loop as well (i.e. until there are no
more results, or we've observed the global checkpoint.)

* sampling/pubsub: only query get metric from _stats

* pubsub: force-refresh indices

Refresh indices after observing an updated global checkpoint
to ensure document visibility is correct up to the observed
global checkpoint.

* Update changelog

* systemtest: fix spurious test failure

(cherry picked from commit 94e3201)

# Conflicts:
#	apmpackage/apm/0.2.0/data_stream/sampled_traces/manifest.yml
#	changelogs/head.asciidoc
#	x-pack/apm-server/sampling/pubsub/pubsub.go
#	x-pack/apm-server/sampling/pubsub/pubsub_test.go
#	x-pack/apm-server/sampling/pubsub/pubsubtest/client.go
@stuartnelson3 stuartnelson3 removed their assignment Jul 9, 2021
@stuartnelson3
Copy link
Contributor

confirmed with SNAPSHOT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[sampling]: rewrite pubsub implementation to work properly with multiple indices and shards
5 participants