sampling: fix pubsub implementation #5126

axw · 2021-04-20T07:43:29Z

Motivation/summary

The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API.
It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard.

Instead of waiting for the Changes API we could take the same approach as elastic/fleet-server#200, using
a new Elasticsearch API built for Fleet Server. The main issue here is that the API does not (currently) support data streams.

Given the above and that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by:

enforcing single-shard indices for sampled trace data streams
searching (now single-shard) backing indices individually

In addition, we now use global checkpoints to bound searches (ensuring all replicas have committed the documents). Querying underlying indices and global checkpoints will require an additional "monitor" index privilege.

Checklist

Update CHANGELOG.asciidoc
~~- [ ] Documentation has been updated~~

How to test these changes

Run two APM Servers with tail-based sampling enabled
Create some transactions
Force a rollover of the sampled traces data stream, to create a new index
Create some more transactions
Check transactions are indexed according to the sampling config

Related issues

Closes #5119

apmmachine · 2021-04-20T07:50:15Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #5126 updated
Start Time: 2021-05-26T04:39:18.067+0000
Duration: 39 min 24 sec
Commit: 90526c6

Test stats 🧪

Test	Results
Failed	0
Passed	6262
Skipped	120
Total	6382

Trends 🧪

The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API. It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard. Given that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by: - enforcing single-shard indices for sampled trace data streams - searching (now single-shard) backing indices individually In addition, we now use global checkpoints to bound searches, and use PIT (point in time) for paging through results. Querying underlying indices and global checkpoints requires an additional "monitor" index privilege.

Simplify by just using direct searches with a rnage on _seq_no, using the most recently observed _seq_no value as the lower bound. We can do this within the loop as well (i.e. until there are no more results, or we've observed the global checkpoint.)

x-pack/apm-server/sampling/pubsub/checkpoints.go

x-pack/apm-server/sampling/pubsub/pubsub.go

Refresh indices after observing an updated global checkpoint to ensure document visibility is correct up to the observed global checkpoint.

mergify · 2021-05-18T18:08:36Z

This pull request is now in conflicts. Could you fix it @axw? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b sampling-pubsub-checkpoints upstream/sampling-pubsub-checkpoints
git merge upstream/master
git push upstream sampling-pubsub-checkpoints

stuartnelson3

Everything looks reasonable, but this is enough out of my realm of knowledge for me to not be able to provide a definitive 👍 / 👎 . I'd be happy to listen to a quick walkthrough of the code, or merge and deal with any potential issues that arise since the current implementation is broken.

mergify · 2021-05-25T09:18:16Z

This pull request is now in conflicts. Could you fix it @axw? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b sampling-pubsub-checkpoints upstream/sampling-pubsub-checkpoints
git merge upstream/master
git push upstream sampling-pubsub-checkpoints

x-pack/apm-server/sampling/pubsub/checkpoints.go

* sampling: fix pubsub implementation The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API. It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard. Given that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by: - enforcing single-shard indices for sampled trace data streams - searching (now single-shard) backing indices individually In addition, we now use global checkpoints to bound searches, and use PIT (point in time) for paging through results. Querying underlying indices and global checkpoints requires an additional "monitor" index privilege. * sampling/pubsub: remove PIT again Simplify by just using direct searches with a rnage on _seq_no, using the most recently observed _seq_no value as the lower bound. We can do this within the loop as well (i.e. until there are no more results, or we've observed the global checkpoint.) * sampling/pubsub: only query get metric from _stats * pubsub: force-refresh indices Refresh indices after observing an updated global checkpoint to ensure document visibility is correct up to the observed global checkpoint. * Update changelog * systemtest: fix spurious test failure (cherry picked from commit 94e3201) # Conflicts: # changelogs/head.asciidoc

* sampling: fix pubsub implementation (#5126) * sampling: fix pubsub implementation The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API. It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard. Given that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by: - enforcing single-shard indices for sampled trace data streams - searching (now single-shard) backing indices individually In addition, we now use global checkpoints to bound searches, and use PIT (point in time) for paging through results. Querying underlying indices and global checkpoints requires an additional "monitor" index privilege. * sampling/pubsub: remove PIT again Simplify by just using direct searches with a rnage on _seq_no, using the most recently observed _seq_no value as the lower bound. We can do this within the loop as well (i.e. until there are no more results, or we've observed the global checkpoint.) * sampling/pubsub: only query get metric from _stats * pubsub: force-refresh indices Refresh indices after observing an updated global checkpoint to ensure document visibility is correct up to the observed global checkpoint. * Update changelog * systemtest: fix spurious test failure (cherry picked from commit 94e3201) # Conflicts: # changelogs/head.asciidoc * Delete head.asciidoc Co-authored-by: Andrew Wilkins <axw@elastic.co>

* sampling: fix pubsub implementation The initial implementation was written as a ~quick hack, with the expectation that it would be replaced by the Changes API. It was broken due to its ignorance of data streams, and multi-shard indices. Sequence numbers are only comparable within a single shard. Given that there is no known delivery date for the Changes API, we propose to instead revise the pubsub implementation to address the problems by: - enforcing single-shard indices for sampled trace data streams - searching (now single-shard) backing indices individually In addition, we now use global checkpoints to bound searches, and use PIT (point in time) for paging through results. Querying underlying indices and global checkpoints requires an additional "monitor" index privilege. * sampling/pubsub: remove PIT again Simplify by just using direct searches with a rnage on _seq_no, using the most recently observed _seq_no value as the lower bound. We can do this within the loop as well (i.e. until there are no more results, or we've observed the global checkpoint.) * sampling/pubsub: only query get metric from _stats * pubsub: force-refresh indices Refresh indices after observing an updated global checkpoint to ensure document visibility is correct up to the observed global checkpoint. * Update changelog * systemtest: fix spurious test failure (cherry picked from commit 94e3201) # Conflicts: # apmpackage/apm/0.2.0/data_stream/sampled_traces/manifest.yml # changelogs/head.asciidoc # x-pack/apm-server/sampling/pubsub/pubsub.go # x-pack/apm-server/sampling/pubsub/pubsub_test.go # x-pack/apm-server/sampling/pubsub/pubsubtest/client.go

stuartnelson3 · 2021-07-19T15:17:46Z

confirmed with SNAPSHOT

axw force-pushed the sampling-pubsub-checkpoints branch from 589df06 to c670d64 Compare April 20, 2021 09:47

axw added the v7.14.0 label Apr 20, 2021

axw force-pushed the sampling-pubsub-checkpoints branch from c670d64 to bf62e28 Compare April 20, 2021 11:52

axw force-pushed the sampling-pubsub-checkpoints branch from bf62e28 to 845e77e Compare April 20, 2021 11:55

axw added 2 commits April 22, 2021 13:39

Merge branch 'master' into sampling-pubsub-checkpoints

c7445fc

sampling/pubsub: remove PIT again

c62d7d8

Simplify by just using direct searches with a rnage on _seq_no, using the most recently observed _seq_no value as the lower bound. We can do this within the loop as well (i.e. until there are no more results, or we've observed the global checkpoint.)

henningandersen reviewed Apr 26, 2021

View reviewed changes

x-pack/apm-server/sampling/pubsub/checkpoints.go Show resolved Hide resolved

henningandersen reviewed Apr 26, 2021

View reviewed changes

x-pack/apm-server/sampling/pubsub/pubsub.go Outdated Show resolved Hide resolved

axw added 2 commits April 27, 2021 12:12

sampling/pubsub: only query get metric from _stats

a21ac00

pubsub: force-refresh indices

3f2b243

Refresh indices after observing an updated global checkpoint to ensure document visibility is correct up to the observed global checkpoint.

axw marked this pull request as ready for review April 27, 2021 07:38

axw requested a review from a team April 27, 2021 07:38

axw added 6 commits April 27, 2021 15:39

Merge branch 'master' into sampling-pubsub-checkpoints

d681d4b

Update changelog

b86a659

Merge branch 'master' into sampling-pubsub-checkpoints

f28e7db

Merge branch 'master' into sampling-pubsub-checkpoints

a1f6ad5

Merge branch 'master' into sampling-pubsub-checkpoints

ef6e98c

Merge branch 'master' into sampling-pubsub-checkpoints

7f87e18

Merge branch 'master' into sampling-pubsub-checkpoints

5d39a39

stuartnelson3 reviewed May 25, 2021

View reviewed changes

simitt reviewed May 25, 2021

View reviewed changes

x-pack/apm-server/sampling/pubsub/checkpoints.go Show resolved Hide resolved

axw added 2 commits May 26, 2021 11:31

Merge branch 'master' into sampling-pubsub-checkpoints

754a483

systemtest: fix spurious test failure

90526c6

simitt approved these changes May 26, 2021

View reviewed changes

axw merged commit 94e3201 into elastic:master May 26, 2021

axw deleted the sampling-pubsub-checkpoints branch May 26, 2021 09:21

mergify bot mentioned this pull request May 26, 2021

[7.x] sampling: fix pubsub implementation (backport #5126) #5349

Merged

stuartnelson3 added the test-plan label Jun 29, 2021

stuartnelson3 self-assigned this Jul 8, 2021

mergify bot mentioned this pull request Jul 8, 2021

[7.14] sampling: fix pubsub implementation (backport #5126) #5640

Closed

stuartnelson3 removed their assignment Jul 9, 2021

stuartnelson3 added the test-plan-ok label Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sampling: fix pubsub implementation #5126

sampling: fix pubsub implementation #5126

axw commented Apr 20, 2021 •

edited

apmmachine commented Apr 20, 2021 •

edited

Build stats

Test stats 🧪

Trends 🧪

mergify bot commented May 18, 2021

stuartnelson3 left a comment

mergify bot commented May 25, 2021

stuartnelson3 commented Jul 19, 2021

sampling: fix pubsub implementation #5126

sampling: fix pubsub implementation #5126

Conversation

axw commented Apr 20, 2021 • edited

Motivation/summary

Checklist

How to test these changes

Related issues

apmmachine commented Apr 20, 2021 • edited

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

mergify bot commented May 18, 2021

stuartnelson3 left a comment

Choose a reason for hiding this comment

mergify bot commented May 25, 2021

stuartnelson3 commented Jul 19, 2021

axw commented Apr 20, 2021 •

edited

apmmachine commented Apr 20, 2021 •

edited