Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] APM Correlations: Fix usage in load balancing/HA setups. #115145

Merged

Conversation

walterra
Copy link
Contributor

@walterra walterra commented Oct 15, 2021

Summary

Part of #114046.

  • The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints.
  • The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state.
  • The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is.
  • API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted.
  • Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side.
  • For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though.
  • The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel.
  • Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.

Checklist

@walterra walterra self-assigned this Oct 15, 2021
@walterra walterra force-pushed the ml-apm-correlations-fix-load-balancing branch from 0255b74 to 538cf24 Compare October 15, 2021 13:47
@walterra walterra changed the title [ML] APM Correlations: Migrate search strategy to regular endpoints [ML] APM Correlations: Migrate custom search strategy to regular endpoints Oct 16, 2021
@walterra walterra changed the title [ML] APM Correlations: Migrate custom search strategy to regular endpoints [ML] APM Correlations: Migrate custom search strategies to regular endpoints Oct 16, 2021
@walterra walterra force-pushed the ml-apm-correlations-fix-load-balancing branch 4 times, most recently from 28eb8d9 to ece4fe0 Compare October 19, 2021 19:18
@walterra walterra added the bug Fixes for quality problems that affect the customer experience label Oct 20, 2021
@walterra walterra changed the title [ML] APM Correlations: Migrate custom search strategies to regular endpoints [ML] APM Correlations: Fix usage in load balancing/HA setups. Oct 20, 2021
@dgieselaar
Copy link
Member

My feedback mostly revolves around parallelising some of the requests which can hopefully improve performance. Long term I think we should move this to one or two API calls and remove progressive loading, I don't think it's worth the complexity.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
apm 1184 1189 +5

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 2.7MB 2.7MB +4.0KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
apm 37 41 +4

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @walterra

Comment on lines +131 to +137
const { fieldCandidates: candidates } = await callApmApi({
endpoint: 'GET /internal/apm/correlations/field_candidates',
signal: abortCtrl.current.signal,
params: {
query: fetchParams,
},
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be parallelised as well no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I parallelised the first two requests to get the chart data first so we can show the chart as soon as it's available. This one is the first of the requests of the analysis, the requests after that one depends on its output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I fully understand, but there's no need (AFAICT) to delay starting the request until the data for the chart has been fully loaded. E.g. you can start the request but only await it once the data for the charts have loaded.


const fieldCandidatesChunks = chunk(fieldCandidates, chunkSize);

for (const fieldCandidatesChunk of fieldCandidatesChunks) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This splits all field candidates into chunks, the chunks are called in sequence here on the client side, but all field candidates of a chunk are then queried in parallel on the Kibana server side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but why call these in sequence? how many blocking calls can we expect here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was made in the spirit of "make it slow". I'm sure this can be further optimized, in this PR we started to parallelize the server side calls and play it safe on the client side. Since the field candidates and field value pairs are generated dynamically, we don't want to allow to run an unlimited amount of queries in parallel. Field candidates are usually in the dozens, field value pairs can be in the hundreds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how running dozens of requests sequentially is a better experience. We can use pLimit here to limit the number of concurrent requests. Something like 5 sounds like a good start.

Copy link
Member

@sorenlouv sorenlouv Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using p-limit here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p-limit sounds like a good idea worth pursuing but can we agree to do this in a follow up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's fine 👍

Copy link
Member

@dgieselaar dgieselaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve this with the caveat that I haven't fully tested it - I'm mostly new to this code and we don't have a much time, and this bug fix is sorely needed. Thanks @walterra!

@walterra walterra merged commit f9c982d into elastic:main Nov 9, 2021
@walterra walterra deleted the ml-apm-correlations-fix-load-balancing branch November 9, 2021 09:27
walterra added a commit to walterra/kibana that referenced this pull request Nov 9, 2021
…c#115145)

- The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints.
- The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state.
- The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is.
- API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted.
- Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side.
- For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though.
- The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel.
Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.
walterra added a commit to walterra/kibana that referenced this pull request Nov 9, 2021
…c#115145)

- The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints.
- The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state.
- The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is.
- API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted.
- Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side.
- For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though.
- The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel.
Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.
walterra added a commit that referenced this pull request Nov 9, 2021
…115145) (#117979)

* [ML] APM Correlations: Fix usage in load balancing/HA setups. (#115145)

- The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints.
- The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state.
- The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is.
- API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted.
- Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side.
- For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though.
- The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel.
Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.

* [ML] Fix http client types.
walterra added a commit that referenced this pull request Nov 9, 2021
… (#118004)

- The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints.
- The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state.
- The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is.
- API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted.
- Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side.
- For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though.
- The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel.
Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.
tylersmalley pushed a commit that referenced this pull request Nov 9, 2021
Conflict between #117958 and #115145

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
spalger pushed a commit to kibanamachine/kibana that referenced this pull request Nov 9, 2021
Conflict between elastic#117958 and elastic#115145

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
kibanamachine added a commit that referenced this pull request Nov 9, 2021
…117958) (#118074)

* [kbn/io-ts] export and require importing individual functions (#117958)

* [kbn/io-ts] fix direct import

Conflict between #117958 and #115145

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>

Co-authored-by: Spencer <email@spalger.com>
Co-authored-by: spalger <spalger@users.noreply.github.com>
Co-authored-by: Tyler Smalley <tyler.smalley@elastic.co>
kpatticha pushed a commit to kpatticha/kibana that referenced this pull request Nov 10, 2021
Conflict between elastic#117958 and elastic#115145

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
fkanout pushed a commit to fkanout/kibana that referenced this pull request Nov 17, 2021
…c#115145)

- The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints.
- The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state.
- The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is.
- API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted.
- Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side.
- For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though.
- The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel.
Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.
fkanout pushed a commit to fkanout/kibana that referenced this pull request Nov 17, 2021
Conflict between elastic#117958 and elastic#115145

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
roeehub pushed a commit to build-security/kibana that referenced this pull request Dec 16, 2021
…c#115145)

- The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints.
- The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state.
- The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is.
- API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted.
- Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side.
- For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though.
- The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel.
Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.
roeehub pushed a commit to build-security/kibana that referenced this pull request Dec 16, 2021
Conflict between elastic#117958 and elastic#115145

Signed-off-by: Tyler Smalley <tyler.smalley@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:correlations bug Fixes for quality problems that affect the customer experience :ml release_note:fix Team:APM All issues that need APM UI Team support v7.16.0 v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants