Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent requests hanging, eventually crashing the tab/browser #20897

Closed
alyssaruth opened this issue Apr 4, 2022 · 84 comments
Closed

Intermittent requests hanging, eventually crashing the tab/browser #20897

alyssaruth opened this issue Apr 4, 2022 · 84 comments
Labels
CI General issues involving running in a CI provider topic: network type: bug type: performance 🏃‍♀️ Performance related

Comments

@alyssaruth
Copy link

Current behavior

This is an issue we've been seeing for some time intermittently in our CI pipeline, but we had a particularly clean / minimal example on Friday so this issue will specifically reference what we saw on that occasion.

We observed that one of our parallel runners had stopped emitting any output, suggesting that Cypress was stuck somewhere. Using VNC, we remoted into the display that Cypress uses to see what was going on, and what we saw was the browser spinning trying to load our tests. The browser was fully responsive, and refreshing the tab saw it getting stuck in the same place every time:

Selection_999(021)

In the network tab, we could see that it was the XHR request to $BASE_URL/__cypress/tests?p=integration/ci/glean/tasks.spec.ts which was getting stuck. Chrome listed it as 'Pending' - I've attached some screenshots of what we could see in the network tab itself.

Selection_999(022)
Selection_999(023)
Selection_999(024)

We were also able to reproduce in the Chrome console by firing off a manual fetch('$BASE_URL/__cypress/tests?p=integration/ci/glean/tasks.spec.ts') - the promise that was returned never completed. We did this about three times, before firing off one for a different spec file to see what would happen. Immediately as we did this, the browser completely crashed and we got a Javascript heap out of memory in the runner logs - see crashed output.txt.

We've been trying to narrow down this issue for some time now, and have held off raising a Cypress issue as we were concerned that it might be a regression in our own app. However, in this instance the problem was occurring before our site was even loaded, leading us to believe it's a Cypress issue (specifically around the way that requests are proxied).

Desired behavior

No response

Test code to reproduce

Our cypress.json looks as follows, if it's of any interest:

{
  "integrationFolder": "integration",
  "pluginsFile": "plugins/index.js",
  "screenshotsFolder": "screenshots",
  "videosFolder": "videos",
  "fixturesFolder": "fixtures",
  "supportFile": "support/index.ts",
  "chromeWebSecurity": false,
  "defaultCommandTimeout": 20000,
  "numTestsKeptInMemory": 0,
  "videoUploadOnPasses": false,
}

We are running Cypress in headed mode.

Cypress Version

9.5.1

Other

The browser in this case was Edge 100, but we're seeing the same issues in Chrome as well.

@davidmunechika davidmunechika added topic: network type: bug stage: needs investigating Someone from Cypress needs to look at this labels Apr 4, 2022
@cypress-bot cypress-bot bot added stage: backlog and removed stage: needs investigating Someone from Cypress needs to look at this labels Apr 29, 2022
@sharmilajesupaul
Copy link
Contributor

We're able to reproduce this on Chrome as well, we recently updated from v7.7.0 to v9.5.3 and we don't see the issue in v7 but do see it on v9.5.3.

@alyssaruth
Copy link
Author

alyssaruth commented May 5, 2022

We've added debug logging for the browsers (by adding --enable-logging --v=1 to our launchArgs), and are consistently seeing request errors to security-type endpoints when the problem occurs. I'm not sure at this point whether this is the root cause or just a consistent symptom. I've included sample debug logs from both Edge and Chrome as attachments.

@sharmilajesupaul it'd be interesting to know if you're seeing similar in your Chrome logs!

When remoted in and observing a build becoming stuck, what we see in the network tab is a bunch of requests that are stuck in a 'Pending' state, all showing as 'Stalled' under timings. For our app, these are generally third party things such as google recaptcha, sentry, and so on. I do not believe it is any of these individual requests that are at fault, because it's differing ones each time (and I've wasted some time stamping out various ones of them to no avail). In the case of my original post, it was a request to get the spec file itself that was stuck! The devtools docs suggest this happens when things are stuck behind "higher priority requests" - maybe they're stuck behind these security ones that we can see going wrong in debug logs?

Stalled recaptcha request
Stalled explanation

In an attempt to resolve this or get more information, I've tried stubbing out the security endpoints in Edge via cy.intercept() in a before each along these lines:

cy.intercept('GET', 'https://edge.microsoft.com/**', { statusCode: 200, body: '' }).as('edgeStub')

I can see that it's working in general for other edge endpoints (from the cypress logging when an intercept occurs), but it's unsuccessful in preventing requests to https://edge.microsoft.com/extensionrevocation/v1/threatListUpdates:... (even though I've verified via Cypress.minimatch that they should be picked up). These requests also do not show up in the network tab - I'm guessing that because they're security-related they're a bit special and perhaps not possible to intercept.

We also gave the --disable-client-side-phishing-detection browser flag a go, but that similarly failed to prevent the problem (and the extensionrevocation hit was still in the browser debug logs). Kinda out of ideas at this point, although it feels like we're closer to the root cause!

@alyssaruth
Copy link
Author

alyssaruth commented May 10, 2022

We've disabled safe browsing via the following snippet (inspired by #9017):

const originalPrefs = launchOptions.preferences.default
    launchOptions.preferences.default = {
      ...originalPrefs,
      safebrowsing: {
        enabled: false,
      },
    }

This has prevented the errors around https://safebrowsing.googleapis.com/v4/threatListUpdates in Chrome, but it has not stopped the requests to https://edge.microsoft.com/extensionrevocation/v1/threatListUpdates:... in Edge (even though I have verified by going into the settings that it's disabled safe browsing there correctly). Early data suggests that the issue is now more prevalent on Edge than Chrome for us (we run on both), but it's a small sample size and Chrome is still affected. For Chrome, the URL that's always in the debug log is https://clientservices.googleapis.com/chrome-variations/seed?osname=linux&channel=stable&milestone=99, which is mentioned in the open issue linked above as something that can't be prevented via browser flags or preferences 😞

We'll continue to monitor the data on our end and update if we have any breakthroughs.

@cypress-bot cypress-bot bot added stage: investigating Someone from Cypress is looking into this and removed stage: new issues labels Jun 3, 2022
@mschile
Copy link
Contributor

mschile commented Jun 5, 2022

@alyssa-glean, if possible, could you run Cypress with the debug logs turned on:

DEBUG=cypress-verbose:proxy:http,cypress:* cypress run

@cypress-bot cypress-bot bot added stage: awaiting response Potential fix was proposed; awaiting response and removed stage: investigating Someone from Cypress is looking into this labels Jun 5, 2022
@mjhenkes mjhenkes assigned AtofStryker and unassigned mschile Jun 7, 2022
@sharmilajesupaul
Copy link
Contributor

sharmilajesupaul commented Jun 7, 2022

@alyssa-glean My apologies, I haven't had the time to pull the logs on this, but this stopped happening for us after updating our version of Chrome. We were running on a much older Chrome 89.0.4389.82 with Cypress v9 was causing this issue for a particular test suite. When we bumped the version to 101.0.4951.54 (Which was the latest at the time, on may 4th 2022), that seems to have resolved the issue.

@alyssaruth
Copy link
Author

@alyssa-glean My apologies, I haven't had the time to pull the logs on this, but this stopped happening for us after updating our version of Chrome. We were running on a much older Chrome 89.0.4389.82 with Cypress v9 was causing this issue a particular test suite hanging in this way we bumped the version to 101.0.4951.54 (Which was the latest at the time, on may 4th 2022). And that seems to have resolved the issue.

No problem - glad you've found something that worked for you! We did a lot of bumping of browser versions when we first hit this, so far to no avail - but we're a few stops short of 101.x.y.z so perhaps we'll give that a go.

if possible, could you run Cypress with the debug logs turned on:

Sure, we'll try this too and get back to you - we have tinkered with the debug logs a bit but not with these exact flags. We didn't spot anything interesting in them but we're definitely not the experts 😅

@tonysimpson-sonocent
Copy link

@mschile hi, I work with Alyssa. I have cypress logs with cypress-verbose:proxy:http,cypress:* from a frozen build I can send them to you if there some secure way of doing that.

We know a bit more about the issue, it looks like we are triggering a bug in chrome's javascript engine, we can see using gdb that the --render process is just looping over the same few functions related to javascript Exceptions and rendering stack frames, we also know that the chrome devtools does not work properly during a hang and if you try to start the javascript profiler it hangs. This all suggests to me that the javascript engine is stuck in some uninterruptible loop, if I can get chrome to tell me about JIT'd code locations/unwinding I might be able to workout what's causing it.

At the moment this is a real heisenbug as a lot of the things we're tried to make it easier to reproduce or debug it have caused it to disappear. So I think this is a javascript engine bug, and a very weird one. This was a very long way of saying I'm not sure the cypress logs will be helpful.

@conversaShawn
Copy link

@tonysimpson-sonocent Can you send me an email with the debugging logs to shawn.harris@cypress.io, please? Please list this GitHub issue as the subject line.

@tonysimpson-sonocent
Copy link

Cypress debugging log sent.

@conversayShawn
Copy link

@tonysimpson-sonocent Thank you!

@conversayShawn conversayShawn added stage: investigating Someone from Cypress is looking into this and removed stage: awaiting response Potential fix was proposed; awaiting response labels Jun 13, 2022
@AtofStryker
Copy link
Contributor

Cypress debugging log sent.

Hey @tonysimpson-sonocent. I was able to go through the debug logs. From the crash logs and the logs you sent, it looks like the server is running out of memory, which is possible since the browser is headful in CI and video recording is turned on. How much memory is allocated to the Jenkins job / Cypress node instance? It looks like 2GB on my end. Have you tried increasing the memory to see if the issue resolves? Could you try to increase to maybe 4GB or 8GB and capture the debug logs then and see if the issue persists? If it does persist, it could be indicative of a memory leak someplace.

@tonysimpson-sonocent
Copy link

@AtofStryker the instance running cypress has 14GB. I don't see any evidence of processes running out of memory, I do see in the log and from other means that the chrome/edge process runs out of javascript heap. I think this is due to a bug in chrome that our tests are triggering. The browser says theres about a 3.5GB limit on the javascript heap

> console.memory
MemoryInfo {totalJSHeapSize: 386000000, usedJSHeapSize: 364000000, jsHeapSizeLimit: 3760000000}

We can see in the chrome log captured by cypress this is hit:

[4240:0x398200818000]  2408097 ms: Scavenge 3508.2 (4087.3) -> 3501.0 (4087.6) MB, 6.6 / 0.0 ms  (average mu = 0.313, current mu = 0.290) allocation failure; 
[4240:0x398200818000]  2408162 ms: Scavenge 3509.9 (4087.6) -> 3502.7 (4087.8) MB, 6.7 / 0.0 ms  (average mu = 0.313, current mu = 0.290) allocation failure; 
[4240:0x398200818000]  2408225 ms: Scavenge 3511.5 (4087.8) -> 3504.4 (4088.1) MB, 6.9 / 0.0 ms  (average mu = 0.313, current mu = 0.290) allocation failure; 
<--- JS stacktrace ---> +8m
  cypress:launcher edge stderr: [4240:4240:0613/085118.101964:ERROR:v8_per_isolate_data.cc(425)] V8 javascript OOM: (Ineffective mark-compacts near heap limit)

I'm pretty sure this memory leak is a bug in chrome/edge's javascript engine - see my previous comment.

@AtofStryker
Copy link
Contributor

@tonysimpson-sonocent interesting. Is there anything that consistently reproduces the issue, or evades it completely?

Have you tried increasing the max_old_space_size for the V8 process in the browser to up the heap size? I know this is not really a solution, but I am curious if this alleviates the symptom in the meantime. You should be able to accomplish this by leveraging the --js-flags to do --js-flags="--max_old_space_size=8192" (8GB or whatever value you want to try) and pass it into Cypress here.

@AtofStryker
Copy link
Contributor

possibly related to #22128 but cert errors might be a red herring?

@AtofStryker
Copy link
Contributor

one run (228) hung in 4.3.0. Going to downgrade to 4.1.0.

@AtofStryker
Copy link
Contributor

had a couple of hangers in 4.1.0. Going to rerun to just make sure the job wasn't taking particularly long

@AtofStryker
Copy link
Contributor

Looks like the hangs were legit. Downgrading to 4.0.2 and see how that does. @alyssa-glean I see you are rerunning the 4.0.0 commit with the 4.1.0 docker image. Let me know if it hangs up on you. I ran it 30 times and couldn't reproduce a hang.

@alyssaruth
Copy link
Author

Looks like the hangs were legit. Downgrading to 4.0.2 and see how that does. @alyssa-glean I see you are rerunning the 4.0.0 commit with the 4.1.0 docker image. Let me know if it hangs up on you. I ran it 30 times and couldn't reproduce a hang.

Yeah, I thought I'd help out a bit in confirming/bulking up the sample size - it's very exciting news if we truly cannot replicate on 4.0.0! 🎉

One thing to keep in mind is that browser version might be a factor... but am I right in thinking that all of your 4.x.y tests have used the same 4.1.0 base Cypress image and would therefore also have the same chrome version?

@AtofStryker
Copy link
Contributor

@alyssa-glean I have have been regularly bumping the docker image when possible. I did reproduce the hang with 4.1.0 with the 4.1.0 image and did not get a hang with 4.0.0 with the 4.1.0 image, mostly out of dumb luck needed node 12. What I am thinking is when we get to a spot where we know the "version" number is the hanging one, we can try changing some of those variables (same images/browsers/etc).

@alyssaruth
Copy link
Author

Well those 5 didn't stick, so I guess that's 35 total for 4.0.0. Although the output suggests it's still using 4.1.0 despite what's in package.json?:

┌────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ Cypress:    4.1.0                                                                              │
  │ Browser:    Chrome 80                                                                          │
  │ Specs:      13 found (ci/admin/atsp.spec.ts, ci/admin/inviteAdmin.spec.ts, ci/admin/inviteAsOr │
  │             gAdmin.spec.ts, ci/admin/inviteAsSuperUser.spec.ts, ci/admin/login.spec.ts, ci/adm │
  │             in/manageUser.spec.ts, ci/admin/misc.spec.ts, ci/admin/navbar.spec.ts, ci/admin/or │
  │             ganisations...)                                                                    │
  │ Searched:   integration/ci/**/*.spec.ts                                                        │
  └────────────────────────────────────────────────────────────────────────────────────────────────┘

@AtofStryker
Copy link
Contributor

Ah I am looking at this docker image and we do install 4.1.0 globally. Seems weird that I was able to get the one with 4.1.0 installed in from lock to hang but 4.0.0 didn't. Either way we are going to need to dig a bit more as to why that is. 35 runs without a hang is good. I'm just wondering as to why that is now 🤔

@AtofStryker
Copy link
Contributor

did see a hang with 4.0.2 with the 4.1.0 docker image. I almost want to make a docker image with similar settings with 4.1.0 with cypress 4.0.0 installed and run it multiple times to see if I can produce a hang.

@jippeholwerda
Copy link

We also have the hanging behaviour for quite a while now. I have the feeling it was introduced with a new version of Chrome and that it might not be related to Cypress. For instance, Chrome version 90 is working properly, but version 99 is hanging consistently.

@AtofStryker
Copy link
Contributor

ran with the custom docker image (4.0.0). One job timed out, but was giving steady input. So I increased the timeout to 45 minutes and am monitoring.

@AtofStryker
Copy link
Contributor

A bit of a frustrating one today. I created a docker image that installs node 12 and 4.0.0 globally. I bumped the job timeout to 45 minutes and STILL had 1-2 hang on me. We did add a few additional flags from 4.0.0-4.1.0 in the docker image but I am starting to wonder if that even might be a red herring and we were just lucky with runs that did not hang.

I am wondering if we next try

  • running without the specified env variables in the docker image
  • downgrade to 3.x

@AtofStryker
Copy link
Contributor

I also noticed that some tests take an incredibly long time to finish, such as this one (a little over 12 minutes)
Screen Shot 2022-08-15 at 3 51 32 PM

@alyssaruth
Copy link
Author

I am starting to wonder if that even might be a red herring and we were just lucky with runs that did not hang.

Yeah, this has been the nature of the problem for us throughout TBH. Since I started looking at it I've been convinced I've found a plausible fix 3 or 4 times only to ultimately be proven wrong. In one instance we had no stuck builds for a couple of days before we found out my latest change hadn't resolved it 😞 . Intermittent problems are such a pain to investigate!

Downgrading to 3.x sounds a reasonable next step to me.

I also noticed that some tests take an incredibly long time to finish, such as this one (a little over 12 minutes)

I think these will be runs that got stuck but then managed to unstick themselves again - we've seen this as well while investigating, albeit rarely. Usually just at the point where you're ready to throw some diagnostics at it 😈 .

@AtofStryker
Copy link
Contributor

Definitely can empathize with the frustration 😅 . I just hope we get close to something soon. I did take a look at downgrading to 3.8.3. The docker image in the cypress-docker-images had Chrome 77, which seems to not be able to interpret some of the JS from the site bundle. So I built a custom 3.8.3 image that uses similar configurations to the 4.1.0 included image (chrome 80), just with cypress 3.8.3 installed.

Ran 10 times so far and haven't seen a hang yet. Maybe a good sign?

@AtofStryker
Copy link
Contributor

spoke too soon. 364 hung up 😢

@AtofStryker
Copy link
Contributor

@alyssa-glean I am getting ready to move off of rotation this week, but @rachelruderman is going to be taking over for me on this issue for the next few weeks. Would you be able to add her to the repo to contribute?

@cypress-bot cypress-bot bot added stage: routed to e2e-auth and removed stage: investigating Someone from Cypress is looking into this labels Aug 19, 2022
@alyssaruth
Copy link
Author

Hey, just wanted to share an update on what we've been up to the last week or so.

I've just merged a PR that simplifies the test repo quite a bit more. There's now just a single spec, which runs the same test in a loop 300 times. The test is a "failed login flow" - all it does is enter some credentials, hit 'log in' and expect an error to be shown.

One upside of this is that there's no longer any need for any backend services, and so we've also added the local config file into source control (because there are no credentials needing to be wired anymore). This should mean that any collaborator can get up and running locally straight away 🎉

I've started experimenting with hosting our site in different ways to see if the problem remains or goes away. Currently our app is hosted in a k8s cluster behind traefik - I'm playing with a branch where we host it as a static S3 bucket instead on raw http. So far, I haven't reproduced the hang on this branch (but would like to do more runs to be sure).

@rachelruderman
Copy link
Contributor

@alyssa-glean I'm so glad to hear that! Please keep us updated 🙏

@alyssaruth
Copy link
Author

It's been a while since I last gave an update, but I have some good news. We ran another experiment, this time to see what happened if we ran the same minimal scenario but using Playwright instead of Cypress. We rewrote a simple login test and did a bunch of builds with it running in a loop and we're now very confident that we've run into the exact same problem.

It manifests slightly differently, because Playwright detects that their worker has stopped responding and spins up a new one in its place (we see Worker teardown timeout of 30000ms exceeded while tearing down "context". in the logs). The resulting failure videos are either unplayable files or long videos that show our app hanging - we see these exact same symptoms in our regular Cypress builds that get stuck too.

This means the problem must lie elsewhere, and it's some interaction between our app and the browser that's the root of the problem. The bad news (for us) is that we still don't know exactly what, but we're definitely getting closer. And it means I can close this issue and not waste any more of your time. Thanks for all the help you've given us with this, we've really appreciated it! 🏆

@AtofStryker
Copy link
Contributor

@alyssa-glean that's great news that there isn't something Cypress related, but I am bummed the issue is still present. We are always glad to help and if you run into any additional trouble please reach out!

@cameroncooks-branch
Copy link

@alyssa-glean can you let us know if you find out the root cause between your app and the browser? We have been experiencing the same issues as you

@alyssaruth
Copy link
Author

@alyssa-glean can you let us know if you find out the root cause between your app and the browser? We have been experiencing the same issues as you

Sure thing. To be honest, this has been plaguing us for so long now that I think I'll be shouting it from the rooftops whatever it turns out to be! 🙈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI General issues involving running in a CI provider topic: network type: bug type: performance 🏃‍♀️ Performance related
Projects
None yet
Development

No branches or pull requests