Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FTR: test suite teardown takes significant amount of time #171743

Open
dmlemeshko opened this issue Nov 22, 2023 · 1 comment
Open

FTR: test suite teardown takes significant amount of time #171743

dmlemeshko opened this issue Nov 22, 2023 · 1 comment
Assignees
Labels
FTR Team:QA Team label for QA Team

Comments

@dmlemeshko
Copy link
Member

dmlemeshko commented Nov 22, 2023

We noticed a strange delay on FTR suite teardown:

Test runner finish the suite at 18:29:39 CEST and Kibana server is stopped at 18:33:58 CEST , which means something took 4+ minutes to finalise the teardown.

https://buildkite.com/elastic/kibana-pull-request/builds/175830#018bcebe-dd02-4045-b36a-bd9c7d78a7fe

  | 2023-11-14 18:29:39 CEST | └-> "after all" hook: afterTestSuite.trigger in "X-Pack Accessibility Tests - Group 2"
  | 2023-11-14 18:29:39 CEST |  
  | 2023-11-14 18:29:39 CEST | 107 passing (14.0m)
  | 2023-11-14 18:29:39 CEST | 1 pending
  | 2023-11-14 18:29:39 CEST | warn browser[SEVERE] ERROR FETCHING BROWSR LOGS: This driver instance does not have a valid session ID (did you call WebDriver.quit()?) and may no longer be used.
  | 2023-11-14 18:29:48 CEST | proc [kibana] [2023-11-14T17:29:48.196+00:00][INFO ][plugins.fleet] Fleet Usage: {"agents_enabled":true,"agents":{"total_enrolled":0,"healthy":0,"unhealthy":0,"offline":0,"inactive":0,"unenrolled":0,"total_all_statuses":0,"updating":0},"fleet_server":{"total_all_statuses":0,"total_enrolled":0,"healthy":0,"unhealthy":0,"offline":0,"updating":0,"num_host_urls":0}} {"service":{"node":{"roles":["background_tasks","ui"]}}}
  | 2023-11-14 18:29:49 CEST | warn browser[SEVERE] ERROR FETCHING BROWSR LOGS: This driver instance does not have a valid session ID (did you call WebDriver.quit()?) and may no longer be used.
  | 2023-11-14 18:30:00 CEST | warn browser[SEVERE] ERROR FETCHING BROWSR LOGS: This driver instance does not have a valid session ID (did you call WebDriver.quit()?) and may no longer be used.
...
  | 2023-11-14 18:33:11 CEST | warn browser[SEVERE] ERROR FETCHING BROWSR LOGS: This driver instance does not have a valid session ID (did you call WebDriver.quit()?) and may no longer be used.
  | 2023-11-14 18:33:22 CEST | warn browser[SEVERE] ERROR FETCHING BROWSR LOGS: This driver instance does not have a valid session ID (did you call WebDriver.quit()?) and may no longer be used.
  | 2023-11-14 18:33:32 CEST | warn browser[SEVERE] ERROR FETCHING BROWSR LOGS: This driver instance does not have a valid session ID (did you call WebDriver.quit()?) and may no longer be used.
  | 2023-11-14 18:33:42 CEST | warn browser[SEVERE] ERROR FETCHING BROWSR LOGS: This driver instance does not have a valid session ID (did you call WebDriver.quit()?) and may no longer be used.
  | 2023-11-14 18:33:52 CEST | warn browser[SEVERE] ERROR FETCHING BROWSR LOGS: This driver instance does not have a valid session ID (did you call WebDriver.quit()?) and may no longer be used.
  | 2023-11-14 18:33:58 CEST | proc [kibana] [2023-11-14T17:33:58.178+00:00][INFO ][root] SIGTERM received - initiating shutdown {"service":{"node":{"roles":["background_tasks","ui"]}}}
  | 2023-11-14 18:33:58 CEST | proc [kibana] [2023-11-14T17:33:58.179+00:00][INFO ][root] Kibana is shutting down {"service":{"node":{"roles":["background_tasks","ui"]}}}
  | 2023-11-14 18:33:58 CEST | proc [kibana] [2023-11-14T17:33:58.186+00:00][INFO ][plugins.monitoring.monitoring.kibana-monitoring] Monitoring stats collection is stopped {"service":{"node":{"roles":["background_tasks","ui"]}}}
  | 2023-11-14 18:33:58 CEST | proc [kibana] [2023-11-14T17:33:58.206+00:00][INFO ][plugins-system.standard] All plugins stopped. {"service":{"node":{"roles":["background_tasks","ui"]}}}
  | 2023-11-14 18:33:58 CEST | info [kibana] exited with null after 1185.4 seconds
  | 2023-11-14 18:33:58 CEST | info [es] killing node
  | 2023-11-14 18:33:58 CEST | info [es] stopping node ftr
  | 2023-11-14 18:33:58 CEST | info [es] no debug files found, assuming es did not write any
  | 2023-11-14 18:33:58 CEST | info [es] cleanup complete
  | 2023-11-14 18:34:00 CEST | FTR configs complete

Few problems to fix:

  1. unsubscribe poll for logs when WenDriver instance is destroyed, not when FTR cleanup stage is finished
  2. figure out what is delaying teardown and try to make it way faster

Back in August teardown was quite fast:

  | 2023-08-16 11:49:29 CEST | └-> "after all" hook: afterTestSuite.trigger in "Action Task Params"
  | 2023-08-16 11:49:29 CEST |  
  | 2023-08-16 11:49:29 CEST | 1 passing (3.9s)
  | 2023-08-16 11:49:29 CEST |  
  | 2023-08-16 11:49:37 CEST | warn [kibana] already stopped
  | 2023-08-16 11:49:37 CEST | info [es] killing node
  | 2023-08-16 11:49:37 CEST | info [es] stopping node ftr
  | 2023-08-16 11:49:38 CEST | info [es] no debug files found, assuming es did not write any
  | 2023-08-16 11:49:38 CEST | info [es] cleanup complete
  | 2023-08-16 11:49:38 CEST | info [kbn-ui] exited with null after 98.4 seconds
  | 2023-08-16 11:49:38 CEST | info [kbn-tasks] exited with null after 98.3 seconds
@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 22, 2023
@dmlemeshko dmlemeshko changed the title FTR: FTR: test suite teardown takes significant amount of time Nov 22, 2023
@dmlemeshko dmlemeshko self-assigned this Nov 22, 2023
@dmlemeshko dmlemeshko added the Team:QA Team label for QA Team label Nov 22, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/appex-qa (Team:QA)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Nov 22, 2023
@dmlemeshko dmlemeshko added the FTR label Nov 22, 2023
dmlemeshko added a commit that referenced this issue Jan 12, 2024
…xists (#171672)

## Summary

Part of #171743

Since we no longer collecting code coverage for functional tests, I
removed flushing on stop signal.

We also noticed a failure on test completion and it is probably due to
resubscribing for logs polling when WebDriver session is already
destroyed. See #171743 for actual logs example

My guess, our current approach to keep fetching WebDriver/Browser logs
based on FTR lifecycle is not very robust:
`takeUntil(lifecycle.cleanup.after$)` keeps emitting until the `cleanup`
stage is finished, and internally we keep resubscribing on WebDriver
failures due to `NoSuchSessionError` and keep trying to fetch logs again
& again. Since we don't re-create WebDriver session, it simply doesn't
make sense. Resubscribing seems useless.

I'm replacing `takeUntil(lifecycle.cleanup.after$)` with
`takeWhile((loggingEntry: logging.Entry)
=>!loggingEntry.message.startsWith(FINAL_LOG_ENTRY_PREFIX))` so that we
stop as soon as we get NoSuchSessionError even if it is before
cleanup.after stage in FTR lifecycle (e.g. browser crashed during the
test run)

For **reviewers**: please give it a try locally and check that you are
able to see browser logs, but no errors/delay on FTR teardown.

How to test:
- start servers & test runner
- let the browser window to be opened by WebDriver and then close it
manually
- check console logs: `browser` logs should be stopped immediately, you
don't see any browser/webdriver logs other than general FTR error
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Jan 12, 2024
…xists (elastic#171672)

## Summary

Part of elastic#171743

Since we no longer collecting code coverage for functional tests, I
removed flushing on stop signal.

We also noticed a failure on test completion and it is probably due to
resubscribing for logs polling when WebDriver session is already
destroyed. See elastic#171743 for actual logs example

My guess, our current approach to keep fetching WebDriver/Browser logs
based on FTR lifecycle is not very robust:
`takeUntil(lifecycle.cleanup.after$)` keeps emitting until the `cleanup`
stage is finished, and internally we keep resubscribing on WebDriver
failures due to `NoSuchSessionError` and keep trying to fetch logs again
& again. Since we don't re-create WebDriver session, it simply doesn't
make sense. Resubscribing seems useless.

I'm replacing `takeUntil(lifecycle.cleanup.after$)` with
`takeWhile((loggingEntry: logging.Entry)
=>!loggingEntry.message.startsWith(FINAL_LOG_ENTRY_PREFIX))` so that we
stop as soon as we get NoSuchSessionError even if it is before
cleanup.after stage in FTR lifecycle (e.g. browser crashed during the
test run)

For **reviewers**: please give it a try locally and check that you are
able to see browser logs, but no errors/delay on FTR teardown.

How to test:
- start servers & test runner
- let the browser window to be opened by WebDriver and then close it
manually
- check console logs: `browser` logs should be stopped immediately, you
don't see any browser/webdriver logs other than general FTR error

(cherry picked from commit d34915b)
semd pushed a commit to semd/kibana that referenced this issue Jan 12, 2024
…xists (elastic#171672)

## Summary

Part of elastic#171743

Since we no longer collecting code coverage for functional tests, I
removed flushing on stop signal.

We also noticed a failure on test completion and it is probably due to
resubscribing for logs polling when WebDriver session is already
destroyed. See elastic#171743 for actual logs example

My guess, our current approach to keep fetching WebDriver/Browser logs
based on FTR lifecycle is not very robust:
`takeUntil(lifecycle.cleanup.after$)` keeps emitting until the `cleanup`
stage is finished, and internally we keep resubscribing on WebDriver
failures due to `NoSuchSessionError` and keep trying to fetch logs again
& again. Since we don't re-create WebDriver session, it simply doesn't
make sense. Resubscribing seems useless.

I'm replacing `takeUntil(lifecycle.cleanup.after$)` with
`takeWhile((loggingEntry: logging.Entry)
=>!loggingEntry.message.startsWith(FINAL_LOG_ENTRY_PREFIX))` so that we
stop as soon as we get NoSuchSessionError even if it is before
cleanup.after stage in FTR lifecycle (e.g. browser crashed during the
test run)

For **reviewers**: please give it a try locally and check that you are
able to see browser logs, but no errors/delay on FTR teardown.

How to test:
- start servers & test runner
- let the browser window to be opened by WebDriver and then close it
manually
- check console logs: `browser` logs should be stopped immediately, you
don't see any browser/webdriver logs other than general FTR error
kibanamachine added a commit that referenced this issue Jan 12, 2024
…onger exists (#171672) (#174770)

# Backport

This will backport the following commits from `main` to `8.12`:
- [[ftr/remote] stop WebDriver logs polling when its session no longer
exists (#171672)](#171672)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Dzmitry
Lemechko","email":"dzmitry.lemechko@elastic.co"},"sourceCommit":{"committedDate":"2024-01-12T15:12:35Z","message":"[ftr/remote]
stop WebDriver logs polling when its session no longer exists
(#171672)\n\n## Summary\r\n\r\nPart of #171743\r\n\r\nSince we no longer
collecting code coverage for functional tests, I\r\nremoved flushing on
stop signal.\r\n\r\nWe also noticed a failure on test completion and it
is probably due to\r\nresubscribing for logs polling when WebDriver
session is already\r\ndestroyed. See #171743 for actual logs
example\r\n\r\nMy guess, our current approach to keep fetching
WebDriver/Browser logs\r\nbased on FTR lifecycle is not very
robust:\r\n`takeUntil(lifecycle.cleanup.after$)` keeps emitting until
the `cleanup`\r\nstage is finished, and internally we keep resubscribing
on WebDriver\r\nfailures due to `NoSuchSessionError` and keep trying to
fetch logs again\r\n& again. Since we don't re-create WebDriver session,
it simply doesn't\r\nmake sense. Resubscribing seems useless.\r\n\r\nI'm
replacing `takeUntil(lifecycle.cleanup.after$)`
with\r\n`takeWhile((loggingEntry:
logging.Entry)\r\n=>!loggingEntry.message.startsWith(FINAL_LOG_ENTRY_PREFIX))`
so that we\r\nstop as soon as we get NoSuchSessionError even if it is
before\r\ncleanup.after stage in FTR lifecycle (e.g. browser crashed
during the\r\ntest run)\r\n\r\nFor **reviewers**: please give it a try
locally and check that you are\r\nable to see browser logs, but no
errors/delay on FTR teardown.\r\n\r\nHow to test:\r\n- start servers &
test runner\r\n- let the browser window to be opened by WebDriver and
then close it\r\nmanually\r\n- check console logs: `browser` logs should
be stopped immediately, you\r\ndon't see any browser/webdriver logs
other than general FTR
error","sha":"d34915bec78de95af02e775924d83ef95e6ea5bb","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","FTR","v8.12.1","v8.13.0"],"title":"[ftr/remote]
stop WebDriver logs polling when its session no longer
exists","number":171672,"url":"#171672
stop WebDriver logs polling when its session no longer exists
(#171672)\n\n## Summary\r\n\r\nPart of #171743\r\n\r\nSince we no longer
collecting code coverage for functional tests, I\r\nremoved flushing on
stop signal.\r\n\r\nWe also noticed a failure on test completion and it
is probably due to\r\nresubscribing for logs polling when WebDriver
session is already\r\ndestroyed. See #171743 for actual logs
example\r\n\r\nMy guess, our current approach to keep fetching
WebDriver/Browser logs\r\nbased on FTR lifecycle is not very
robust:\r\n`takeUntil(lifecycle.cleanup.after$)` keeps emitting until
the `cleanup`\r\nstage is finished, and internally we keep resubscribing
on WebDriver\r\nfailures due to `NoSuchSessionError` and keep trying to
fetch logs again\r\n& again. Since we don't re-create WebDriver session,
it simply doesn't\r\nmake sense. Resubscribing seems useless.\r\n\r\nI'm
replacing `takeUntil(lifecycle.cleanup.after$)`
with\r\n`takeWhile((loggingEntry:
logging.Entry)\r\n=>!loggingEntry.message.startsWith(FINAL_LOG_ENTRY_PREFIX))`
so that we\r\nstop as soon as we get NoSuchSessionError even if it is
before\r\ncleanup.after stage in FTR lifecycle (e.g. browser crashed
during the\r\ntest run)\r\n\r\nFor **reviewers**: please give it a try
locally and check that you are\r\nable to see browser logs, but no
errors/delay on FTR teardown.\r\n\r\nHow to test:\r\n- start servers &
test runner\r\n- let the browser window to be opened by WebDriver and
then close it\r\nmanually\r\n- check console logs: `browser` logs should
be stopped immediately, you\r\ndon't see any browser/webdriver logs
other than general FTR
error","sha":"d34915bec78de95af02e775924d83ef95e6ea5bb"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.1","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","branchLabelMappingKey":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"#171672
stop WebDriver logs polling when its session no longer exists
(#171672)\n\n## Summary\r\n\r\nPart of #171743\r\n\r\nSince we no longer
collecting code coverage for functional tests, I\r\nremoved flushing on
stop signal.\r\n\r\nWe also noticed a failure on test completion and it
is probably due to\r\nresubscribing for logs polling when WebDriver
session is already\r\ndestroyed. See #171743 for actual logs
example\r\n\r\nMy guess, our current approach to keep fetching
WebDriver/Browser logs\r\nbased on FTR lifecycle is not very
robust:\r\n`takeUntil(lifecycle.cleanup.after$)` keeps emitting until
the `cleanup`\r\nstage is finished, and internally we keep resubscribing
on WebDriver\r\nfailures due to `NoSuchSessionError` and keep trying to
fetch logs again\r\n& again. Since we don't re-create WebDriver session,
it simply doesn't\r\nmake sense. Resubscribing seems useless.\r\n\r\nI'm
replacing `takeUntil(lifecycle.cleanup.after$)`
with\r\n`takeWhile((loggingEntry:
logging.Entry)\r\n=>!loggingEntry.message.startsWith(FINAL_LOG_ENTRY_PREFIX))`
so that we\r\nstop as soon as we get NoSuchSessionError even if it is
before\r\ncleanup.after stage in FTR lifecycle (e.g. browser crashed
during the\r\ntest run)\r\n\r\nFor **reviewers**: please give it a try
locally and check that you are\r\nable to see browser logs, but no
errors/delay on FTR teardown.\r\n\r\nHow to test:\r\n- start servers &
test runner\r\n- let the browser window to be opened by WebDriver and
then close it\r\nmanually\r\n- check console logs: `browser` logs should
be stopped immediately, you\r\ndon't see any browser/webdriver logs
other than general FTR
error","sha":"d34915bec78de95af02e775924d83ef95e6ea5bb"}}]}] BACKPORT-->

Co-authored-by: Dzmitry Lemechko <dzmitry.lemechko@elastic.co>
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this issue Feb 15, 2024
…xists (elastic#171672)

## Summary

Part of elastic#171743

Since we no longer collecting code coverage for functional tests, I
removed flushing on stop signal.

We also noticed a failure on test completion and it is probably due to
resubscribing for logs polling when WebDriver session is already
destroyed. See elastic#171743 for actual logs example

My guess, our current approach to keep fetching WebDriver/Browser logs
based on FTR lifecycle is not very robust:
`takeUntil(lifecycle.cleanup.after$)` keeps emitting until the `cleanup`
stage is finished, and internally we keep resubscribing on WebDriver
failures due to `NoSuchSessionError` and keep trying to fetch logs again
& again. Since we don't re-create WebDriver session, it simply doesn't
make sense. Resubscribing seems useless.

I'm replacing `takeUntil(lifecycle.cleanup.after$)` with
`takeWhile((loggingEntry: logging.Entry)
=>!loggingEntry.message.startsWith(FINAL_LOG_ENTRY_PREFIX))` so that we
stop as soon as we get NoSuchSessionError even if it is before
cleanup.after stage in FTR lifecycle (e.g. browser crashed during the
test run)

For **reviewers**: please give it a try locally and check that you are
able to see browser logs, but no errors/delay on FTR teardown.

How to test:
- start servers & test runner
- let the browser window to be opened by WebDriver and then close it
manually
- check console logs: `browser` logs should be stopped immediately, you
don't see any browser/webdriver logs other than general FTR error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FTR Team:QA Team label for QA Team
Projects
None yet
Development

No branches or pull requests

2 participants