Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Add browser monitor timeout #32434

Merged
merged 14 commits into from
Jul 27, 2022

Conversation

emilioalvap
Copy link
Collaborator

@emilioalvap emilioalvap commented Jul 21, 2022

What does this PR do?

Fixes #32388.

Added logic to heartbeat side to trigger killing browser monitors' node process if it exceed a configurable timeout without completing. In case the timeout is triggered, the run is marked as failed and a error message is appended:

image

Why is it important?

This mitigates scenarios we've seen where node would go unresponsive and continue running, allocating more resources until OOM.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

  • User the following monitor example:
- type: browser
  enabled: true
  id: Timeout
  name: Timeout
  timeout: 15
  source:
    inline:
      script:
        step("load homepage", async () => {
            await page.goto('https://www.elastic.co');
        });
        step("hover over products menu", async () => {
            await page.hover('css=[data-nav-item=products]');
        });
        step("failme", async () => {
           await (new Promise(done => {
             setTimeout(done, 100000);
           }));
           await page.hover('css=[data-nav-item=notathingonpage]');
        });
  schedule: "@every 1m"

Related issues

@emilioalvap emilioalvap added bug Team:obs-ds-hosted-services Label for the Observability Hosted Services team release-note:fix The content should be included as a fix v8.4.0 labels Jul 21, 2022
@emilioalvap emilioalvap requested a review from a team as a code owner July 21, 2022 11:05
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime (Team:Uptime)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jul 21, 2022
@emilioalvap emilioalvap changed the title Browser monitor timeout [Heartbeat] Add browser monitor timeout Jul 21, 2022
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 21, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-07-26T13:25:25.299+0000

  • Duration: 58 min 19 sec

Test stats 🧪

Test Results
Failed 0
Passed 1635
Skipped 22
Total 1657

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@@ -16,6 +17,7 @@ func DefaultConfig() *Config {
return &Config{
Sandbox: false,
Screenshots: "on",
Timeout: 14 * time.Minute,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewvc should I leave the default option as 15 min already? Or do we want to change that once it's actually able to run to 15 min?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 min sounds weird to me, IMO we can keep it as whole number 15m and increase the service timeout. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vigneshshanmugam we discussed the same idea in tech sync. I noticed Andrew has already created the issue to update k8s timeout, so I'll just update this one to 15 min.

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its just me, But IMO I feel like having timeout on the monitor config does not feel ideal.

  • The timeouts around the lightweight monitors are basically for connection and for the whole check itself which kind of aligns with what we have for browser monitors. But it feels like Browser monitors are doing more jobs than a connection and it feels like a process timeout rather than a monitor timeout. Just one level up.

  • This feature would become super confusing once we allow users to set timeout on the Synthetics tests itself - Configurable timeouts (UJ and Step Level) synthetics#133. As a user, I would expect to set the timeout for individual journeys which deals with how long a test should run as a whole including the step timeouts instead of how long it took for HB to run something.

I am not sure if i am being pedantic, But i feel like this should go inside the task timeout and not the monitor level timeout.

heartbeat.jobs:
    timeout: 15m

@andrewvc
Copy link
Contributor

@vigneshshanmugam I think you make some good points, but it's more of a "yes and" situation IMHO.

I see your point about a process timeout vs. a monitor timeout, but the main purpose of this feature is as a safeguard against the node process hanging and consuming resources forever. A secondary goal is providing better error messages etc.

We should have timeouts in node too , but node can hang as well. Ideally at each subprocess boundary we should have timeouts.

A global timeout setting doesn't make sense in the context of everything we're doing with fleet and agent where we really set parameters per job. We're trying to move toward fleet and the service, where a global heartbeat config really doesn't exist. The closest thing we have is defaults applied to individual jobs.

@@ -127,14 +127,14 @@ func (p *Project) jobs() []jobs.Job {
isScript := p.projectCfg.Source.Inline != nil
if isScript {
src := p.projectCfg.Source.Inline.Script
j = synthexec.InlineJourneyJob(context.TODO(), src, p.Params(), p.StdFields(), p.extraArgs()...)
j = synthexec.InlineJourneyJob(context.TODO(), src, p.Params(), p.StdFields(), p.projectCfg.Timeout, p.extraArgs()...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be cleaner and more idiomatic to use context.WithTimeout than pass the extra parameter. We've passed this context.TODO() for ages in anticipation of having a real timeout, so now may be the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good sugestion, changing that now

go func() {
<-ctx.Done()
toTimer := time.NewTimer(timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just pass the context through we just wait on <-ctx.Done()

@@ -149,11 +149,25 @@ func TestRunBadExitCodeCmd(t *testing.T) {
})
}

func runAndCollect(t *testing.T, cmd *exec.Cmd, stdinStr string) []*SynthEvent {
func TestRunTimeoutExitCodeCmd(t *testing.T) {
cmd := exec.Command("go", "run", "./main.go")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this test, are we sure that there isn't a race here? I wonder if on fast systems that might execute fast enough to be flaky. It might be safer to add a sleep into that go program.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a small timeout to the executable just in case

Copy link
Contributor

@andrewvc andrewvc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good, we just need to clean up the spurious "could not kill synthetics process: os: process already finished" warnings this now throws.

func NewSynthexecCtx(timeout time.Duration) (context.Context, context.CancelFunc) {
cmdTimeout := timeout + 30*time.Second

synthexecCtx := context.WithValue(context.TODO(), SynthexecTimeout, cmdTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
synthexecCtx := context.WithValue(context.TODO(), SynthexecTimeout, cmdTimeout)
synthexecCtx := context.WithValue(context.Background(), SynthexecTimeout, cmdTimeout)

This is the same functionally as TODO but says essentially this is not something we think needs to be eventually replaced.

PS, nice use of WithValue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'll change that so it's clearer

go func() {
<-ctx.Done()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed the kill/warning is a problem since this will happen even during runs that are not broken. I think the simplest way to do that would be to define an atomic bool that flips when cmd.Wait returns and check that here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than use a switch, which introduces a (small) state dependency between the two routines, I'd rather check the exit status with if !cmd.ProcessState.Exited() { // kill and log error}. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that existed, that's perfect

@mergify
Copy link
Contributor

mergify bot commented Jul 22, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b browser-monitor-timeout upstream/browser-monitor-timeout
git merge upstream/main
git push upstream browser-monitor-timeout

Copy link
Contributor

@andrewvc andrewvc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@emilioalvap emilioalvap merged commit b1302b6 into elastic:main Jul 27, 2022
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* Add browser monitor timeout

* Add synthexec unit test for timeout

* Add changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug release-note:fix The content should be included as a fix Team:obs-ds-hosted-services Label for the Observability Hosted Services team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Heartbeat] Add timeout to browser monitors in HB side
4 participants