Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Progressive fetching (experimental) #127598

Merged
merged 34 commits into from
Apr 21, 2022

Conversation

dgieselaar
Copy link
Member

@dgieselaar dgieselaar commented Mar 14, 2022

This implements progressive fetching for the API endpoints used for the service inventory and the trace inventory. Here's how it works:

  • It's off by default. The user can select a low/medium/high sampling rate if they want to turn it on. The lower the sampling rate, the less documents will be selected to be included in the aggregation.
  • If it's enabled, do two requests to the API endpoint: one with a probability < 1, depending on the sampling rate, and one with a probability of 1 (unsampled). The probability value is passed to the random_sampler aggregation which now wraps the aggregations we use to calculate the statistics for the service inventory/trace inventory.
  • The data that is available with the highest sampling rate will be shown on screen.
  • Once any data is in, remove the loading state from the tables so the user can interact with the elements.

Closes #126593.

@kibana-ci
Copy link
Collaborator

kibana-ci commented Mar 14, 2022

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] Default CI Group #1 / APM API tests basic apm_mappings_only_8.0.0 Services APIs when data is loaded compare error rate value between service inventory, error rate chart, service inventory and transactions apis "before all" hook for "returns same avg error rate value for Transaction-based and Metric-based data"
  • [job] [logs] Default CI Group #1 / APM API tests basic apm_mappings_only_8.0.0 Services APIs when data is loaded compare error rate value between service inventory, error rate chart, service inventory and transactions apis "before all" hook for "returns same avg error rate value for Transaction-based and Metric-based data"
  • [job] [logs] Default CI Group #10 / APM specs correlations latency correlations space with no features disabled sets the timePicker to return data
  • [job] [logs] Default CI Group #10 / APM specs correlations latency correlations space with no features disabled sets the timePicker to return data

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
apm 1150 1151 +1

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
observability 355 356 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 2.8MB 2.8MB +495.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
observability 87.4KB 87.6KB +128.0B
Unknown metric groups

API count

id before after diff
observability 358 359 +1

ESLint disabled in files

id before after diff
apm 15 14 -1
uptime 7 6 -1
total -2

ESLint disabled line counts

id before after diff
apm 88 90 +2
uptime 48 42 -6
total -4

Total ESLint disabled count

id before after diff
apm 103 104 +1
uptime 55 48 -7
total -6

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Comment on lines +96 to +97
random_sampler: {
probability,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be seeded by session. Otherwise, on every refresh, the initial data fetched could be different. Seeding allows for consistency between page refreshes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aye, any idea what that should/can be? can it be something like "apm-app"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer number only. So, an integer hash of a string is ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgieselaar just confirming that for anything that is a visualization, the progressive fetching is seeded for a user's session :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that this is handled in Elasticsearch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgieselaar no, it is not. Elasticsearch doesn't know about kibana user's sessions.

If the random_sampler is used for visualizations, it should be seeded. Otherwise a different subset of the data is used on every search call, which would be jarring as the visualization will subtly jump around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang, too bad, because I forgot about it. I'll create a follow-up issue. I don't think this is required for an experimental feature though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -38,7 +38,7 @@ import { eventMetadataRouteRepository } from '../event_metadata/route';
import { suggestionsRouteRepository } from '../suggestions/route';
import { agentKeysRouteRepository } from '../agent_keys/route';

const getTypedGlobalApmServerRouteRepository = () => {
function getTypedGlobalApmServerRouteRepository() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why you changed this but thank you!! :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes two of us!

@lizozom
Copy link
Contributor

lizozom commented Mar 15, 2022

This is a really exciting idea and I think that multiple teams will greatly benefit from this.

If this POC is successful, we could impement this as a search strategy where the random response is emmitted with isPartial: true and the full one is emmitted with isPartial: false.

The search service API already supports emmitting multiple responses and it would be up to the consumer to decide whether they want to render the partial result or wait for the final response to render.

@@ -10,4 +10,5 @@ export const maxSuggestions = 'observability:maxSuggestions';
export const enableComparisonByDefault = 'observability:enableComparisonByDefault';
export const enableInfrastructureView = 'observability:enableInfrastructureView';
export const defaultApmServiceEnvironment = 'observability:apmDefaultServiceEnvironment';
export const enableRandomSampling = 'observability:enableRandomSampling';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add an apm prefix as discussed before for other settings

@dgieselaar
Copy link
Member Author

We pulled this from 8.2 after running out of time. We ran into various things, all related to the random_sampler aggregation being a relatively new feature:

  • The Observability test clusters and the ES image used in APM Integration Testing is not easily updated. The former is updated every one or two weeks, but it's a manual process. APM Integration Testing uses images from the unified release build, but that was red for a week during the development of the feature.
  • There were three consecutive ES bugs in the random_sampler aggregation. All were quickly fixed by Ben, but due to the aforementioned issue w/ working with the latest ES builds, it slowed down implementation of the feature on our side.
  • The Kibana QA team recently added a new option for the functional tests, where backwards compatibility for cross-cluster search is enforced. There is currently no way to disable this flag, so we cannot write any API tests for this feature.
  • The implementation unconditionally wrapped everything in a random_sampler agg, assuming it's a no-op. This made all the API tests fail. We could have probably fixed this, but it's a significant code change and given the time until FF, it's better to pull this change entirely.

@dgieselaar dgieselaar added release_note:enhancement Team:APM All issues that need APM UI Team support labels Apr 12, 2022
services: {
terms: {
field: SERVICE_NAME,
sampled: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the sake of consistency can we use either sample or sampled? 🙏

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! changed sampled to sample.

@kpatticha
Copy link
Contributor

this is super exciting 🥳

it might not be in the scope of the current PR but I've noticed that the detailed_statistics request is fired only when both the sampled and unsampled service inventory fetches are completed.

The downside is that we won't see a performance improvement for rendering the sparklines, it might be even a bit slower because it depends on 2 requests.

@dgieselaar do you think is possible to decouple the dependency?

@dgieselaar
Copy link
Member Author

@dgieselaar do you think is possible to decouple the dependency?

There's a possibility of the service names changing after the unsampled request comes in. I'd like to avoid triggering requests instead of two for detailed_statistics.

The downside is that we won't see a performance improvement for rendering the sparklines, it might be even a bit slower because it depends on 2 requests.

There will still be a perforrmance improvement compared to to day - we already block on main_statistics before we fetch detailed_statistics, and detailed_statistics will use the same progressive loading technique, so the sampled data for detailed_statistics will show up earlier, and the unsampled data should show up in roughly the same time as today.

progressiveLoadingQuality
);

const sampledFetch = useFetcher(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have some explanation about the difference between sampledFetch and unsampledFetch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean about the concept of sampling? IMHO the variable names are pretty descriptive as-is, but yes, they require the reader to understand what sampling is. But explaining that is going to be a long comment here 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least something that says when we should use useProgressiveFetcher over useFetcher.

@dgieselaar
Copy link
Member Author

@cauemarcondes when you use "request changes", can you be explicit about the changes you're requesting? IMHO it should be reserved for blockers and I don't really see any.


const unsampledFetch = useFetcher(
(regularCallApmApi) => {
return callback(clientWithProbability(regularCallApmApi, 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be clearer if you use ProgressiveLoadingQuality here instead of 1?

Suggested change
return callback(clientWithProbability(regularCallApmApi, 1));
return callback(clientWithProbability(regularCallApmApi, ProgressiveLoadingQuality.off));

Copy link
Contributor

@cauemarcondes cauemarcondes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM very nice! 👏🏻

@cauemarcondes
Copy link
Contributor

@cauemarcondes when you use "request changes", can you be explicit about the changes you're requesting? IMHO it should be reserved for blockers and I don't really see any.

Yeah I shouldn't have selected "request changes" my bad.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
apm 1185 1186 +1
observability 390 391 +1
total +2

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
observability 366 370 +4

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 2.8MB 2.8MB +640.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
observability 91.4KB 91.9KB +459.0B
Unknown metric groups

API count

id before after diff
observability 369 373 +4

ESLint disabled line counts

id before after diff
apm 89 92 +3

Total ESLint disabled count

id before after diff
apm 104 107 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@dgieselaar dgieselaar merged commit 7af6915 into elastic:main Apr 21, 2022
@dgieselaar dgieselaar deleted the use-progressive-fetcher branch April 21, 2022 10:11
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Apr 21, 2022
@kibanamachine
Copy link
Contributor

⚪ Backport skipped

The pull request was not backported as there were no branches to backport to. If this is a mistake, please apply the desired version labels or run the backport tool manually.

Manual backport

To create the backport manually run:

node scripts/backport --pr 127598

Questions ?

Please refer to the Backport tool documentation

dmlemeshko pushed a commit to dmlemeshko/kibana that referenced this pull request May 5, 2022
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
kertal pushed a commit to kertal/kibana that referenced this pull request May 24, 2022
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated: Automatically backport this PR after it's merged backport:skip This commit does not require backporting release_note:enhancement Team:APM All issues that need APM UI Team support v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[APM] Add experimental probability setting for aggregations