Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] Add scalability tests for known bottlenecks #151110

Merged
merged 13 commits into from
Mar 10, 2023

Conversation

afharo
Copy link
Member

@afharo afharo commented Feb 14, 2023

Summary

We've identified that for large deployments, some collectors take a big hit in the performance of Kibana (the event loop delay goes crazy high).

This PR adds some scalability tests to measure any future improvements in this area:

Run node scripts/run_scalability.js --journey-path x-pack/test/scalability/apis/api.telemetry.cluster_stats.no_cache.1600_dataviews.json to test if your changes improve the current behaviour.

Adding @dmlemeshko as a reviewer to validate that I'm reading the data correctly 😅

For maintainers

@afharo afharo added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc performance Feature:Telemetry technical debt Improvement of the software architecture and operational architecture labels Feb 14, 2023
@afharo afharo self-assigned this Feb 14, 2023
Copy link
Member Author

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few API Scalability tests and the results are a bit mixed:

  1. When caching is ON, we get better performance in this branch vs. main
  2. However, when caching is OFF, main performs better: we get lower rate of timeouts (this branch makes the response slower) and Kibana releases the resources of each request earlier

I need to jump on another task... I'll get back to this when done.

Comment on lines 12 to 17
{
"action": "rampUsersPerSec",
"minUsersCount": 1,
"maxUsersCount": 10,
"duration": "120s"
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low number of users as we don't expect this API to be called with refreshedCache: true too often.

Copy link
Member Author

@afharo afharo Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with that, we are testing 690 requests in around 4 minutes (2 minutes execution + 120s of response time). That's many more requests than we expect (in the non-cached scenario).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting case: what is the load we expect in non-cache scenario? How frequent?

Since these tests are targeting capacity check, it is expected to increase requests count within reasonable time.

Copy link
Member Author

@afharo afharo Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we only request the telemetry payload when we checked that it hasn't been reported in the last 24h.

That means we should expect, maximum, 1 daily peak of requests (where all active users at that moment may request the endpoint).

On top of that, that's true for the cached request. The non-cached one is only used for admins that want to audit what we send about them, so my expectation is 2-3 requests top.

@@ -0,0 +1,47 @@
{
"journeyName": "POST /api/telemetry/v2/clusters/_stats - no cache - 1600 dataviews",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case struggles even for such a low number of users. If maps improves their collector, this should improve.

@afharo afharo added release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting labels Feb 14, 2023
@dmlemeshko
Copy link
Member

dmlemeshko commented Feb 15, 2023

buildkite/kibana-apis-capacity-testing pipeline failed and I'm also seeing error while running locally:

node scripts/run_scalability.js --journey-path x-pack/test/scalability/apis/api.telemetry.cluster_stats.no_cache.1600_dataviews.json
 proc [scalability-tests]  proc [gatling: test] Simulation org.kibanaLoadTest.simulation.generic.GenericJourney completed in 168 seconds
 proc [scalability-tests]  proc [gatling: test] java.lang.reflect.InvocationTargetException
 proc [scalability-tests]  proc [gatling: test] 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:119)
 proc [scalability-tests]  proc [gatling: test] 	at java.base/java.lang.reflect.Method.invoke(Method.java:577)
 proc [scalability-tests]  proc [gatling: test] 	at io.gatling.plugin.util.ForkMain.runMain(ForkMain.java:67)
 proc [scalability-tests]  proc [gatling: test] 	at io.gatling.plugin.util.ForkMain.main(ForkMain.java:35)
 proc [scalability-tests]  proc [gatling: test] Caused by: java.lang.RuntimeException: Login request failed: org.apache.http.conn.HttpHostConnectException: Connect to localhost:5620 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused
 proc [scalability-tests]  proc [gatling: test] 	at org.kibanaLoadTest.helpers.KbnClient.getCookie(KbnClient.scala:72)
 proc [scalability-tests]  proc [gatling: test] 	at org.kibanaLoadTest.helpers.KbnClient.getClientAndConnectionManager(KbnClient.scala:50)
 proc [scalability-tests]  proc [gatling: test] 	at org.kibanaLoadTest.helpers.KbnClient.unload(KbnClient.scala:139)

I think test is killing Kibana and Gatling fails to unload kbn-archive :(
I will work on a fix (to skip unloading if Kibana/ES not responding)

@afharo
Copy link
Member Author

afharo commented Feb 16, 2023

I'm going to split the concurrency limits and the new tests. I think @mattkime might appreciate having these tests available.

@afharo afharo force-pushed the usage_collection/apply-concurrency-limits branch from 821f56f to dda022c Compare February 16, 2023 12:52
@afharo afharo force-pushed the usage_collection/apply-concurrency-limits branch from dda022c to 35ebbf7 Compare February 16, 2023 12:53
@afharo afharo changed the title [Usage Collection] Apply concurrency limits [Telemetry] Add scalability tests for known bottlenecks Feb 16, 2023
@afharo afharo marked this pull request as ready for review February 16, 2023 14:35
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@afharo
Copy link
Member Author

afharo commented Feb 16, 2023

@elasticmachine merge upstream

@afharo
Copy link
Member Author

afharo commented Feb 20, 2023

#151626 might improve the metrics observed in these new tests. But we may still like to improve how we handle fetching Data Views

@@ -0,0 +1,47 @@
{
"journeyName": "POST /api/telemetry/v2/clusters/_stats - no cache - 1600 dataviews",
"scalabilitySetup": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on local results, I think we need to adjust threshold to track.

    "responseTimeThreshold": {
      "threshold1": 10000,
      "threshold2": 20000,
      "threshold3": 30000
    },

Not sure of response time matters here, we can increase values just to understand how many requests were completed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Member Author

@afharo afharo Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to increasing the timeouts... I'm concerned about the Connection refused failures, though... shouldn't those be timeouts instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeouts increased in elastic/kibana@98c1f84 (#151110)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afharo I think this test basically kills Kibana server and that's why we see "Connection refused"

I can see multiple errors like

[2023-02-21T18:51:35.148+01:00][ERROR][plugins.dataViews.dataView.indexPatterns] ConnectionError: connect EADDRNOTAVAIL 127.0.0.1:9220 - Local (0.0.0.0:0)
    at KibanaTransport.request (/Users/dmle/github/kibana/node_modules/@elastic/transport/src/Transport.ts:585:17)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at KibanaTransport.request (create_transport.ts:57:17)
    at ClientTraced.FieldCapsApi [as fieldCaps] 

(/Users/dmle/github/kibana/node_modules/@elastic/elasticsearch/src/api/api/field_caps.ts:78:10)
    at callFieldCapsApi (es_api.ts:74:12)
    at getFieldCapabilities (field_capabilities.ts:46:23)
    at IndexPatternsFetcher.getFieldsForWildcard (index_patterns_fetcher.ts:76:31)
    at IndexPatternsApiServer.getFieldsForWildcard (index_patterns_api_client.ts:33:12)
    at DataViewsService.getFieldsAndIndicesForWildcard (data_views.ts:541:12)
    at DataViewsService.refreshFieldSpecMap (data_views.ts:624:46)
    at DataViewsService.initFromSavedObjectLoadFields (data_views.ts:756:33)
    at DataViewsService.initFromSavedObject (data_views.ts:798:34)

and

2023-02-16T16:01:15.143+01:00][ERROR][http] NoLivingConnectionsError: There are no living connections
    at KibanaTransport.request (/Users/dmle/github/kibana/node_modules/@elastic/transport/src/Transport.ts:456:17)
    at KibanaTransport.request (/Users/dmle/github/kibana/node_modules/elastic-apm-node/lib/instrumentation/modules/@elastic/elasticsearch.js:143:28)
    at KibanaTransport.request (create_transport.ts:57:29)
    at Security.hasPrivileges (/Users/dmle/github/kibana/node_modules/@elastic/elasticsearch/src/api/api/security.ts:962:33)
    at checkPrivilegesAtResources (check_privileges.ts:121:81)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at Object.atSpaces (check_privileges.ts:219:16)
    at SavedObjectsSecurityExtension.checkSavedObjectsPrivileges [as checkPrivilegesFunc] (check_saved_objects_privileges.ts:52:20)
    at SavedObjectsSecurityExtension.checkPrivileges (saved_objects_security_extension.ts:612:14)
    at SavedObjectsSecurityExtension.checkAuthorization (saved_objects_security_extension.ts:411:45)
    at SavedObjectsSecurityExtension.authorize (saved_objects_security_extension.ts:578:54)
    at SavedObjectsSecurityExtension.authorizeGet (saved_objects_security_extension.ts:822:12)
    at SavedObjectsRepository.get (repository.ts:1704:33)
    at SavedObjectsClient.get (saved_objects_client.ts:117:12)
    at UiSettingsClient.read (ui_settings_client_common.ts:158:20)
    at UiSettingsClient.getUserProvided (ui_settings_client_common.ts:53:59)
    at UiSettingsClient.getAll (base_ui_settings_client.ts:56:26)
    at Object.fieldFormatServiceFactory (plugin.ts:44:31)
    at getVisData (get_vis_data.ts:35:30)
    at vis.ts:36:23
    at Router.handle (router.ts:192:30)
    at handler (router.ts:147:13)
    at exports.Manager.execute (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28)
    at Object.internals.handler (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)
    at exports.execute (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)
    at Request._lifecycle (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/request.js:371:32)
    at Request._execute (/Users/dmle/github/kibana/node_modules/@hapi/hapi/lib/request.js:281:9)

Copy link
Member Author

@afharo afharo Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We covered this on Zoom. The intention of this test is to show that this API is problematic in this scenario. It will help us track any improvements for different efforts that have branched off since we've identified this bottleneck.

Is it a problem to have reports that show failures?

@afharo
Copy link
Member Author

afharo commented Feb 27, 2023

Assigning to @dmlemeshko because he wants to run some additional tests to fine-tune the thresholds 😇

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
securitySolution 434 437 +3

Total ESLint disabled count

id before after diff
securitySolution 514 517 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @afharo @dmlemeshko

@afharo afharo merged commit 84cc0eb into elastic:main Mar 10, 2023
@afharo afharo deleted the usage_collection/apply-concurrency-limits branch March 10, 2023 09:45
bmorelli25 pushed a commit to bmorelli25/kibana that referenced this pull request Mar 10, 2023
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
nkhristinin pushed a commit that referenced this pull request Mar 22, 2023
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Telemetry performance release_note:skip Skip the PR/issue when compiling release notes Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc technical debt Improvement of the software architecture and operational architecture v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants