-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timeouts when running in github actions CI on ubuntu-latest #160
Comments
@markquezada Are you running a lot of PDF-printing tests in parallel? Also, I haven't used github actions much myself, does it change anything on the test container when you ssh into it? Also, since you mentioned Sorry, nothing comes to mind at first thought :/ |
Thanks for the reply! Sorry, just saw it. I'm literally just running the one test shown above... no other PDF tests in parallel or anything like that. I just wanted to confirm it worked in CI and that it picked up the chromium package, etc. I don't believe anything changes when you SSH in. I'm using the I double checked the logs and I did notice these which I hadn't noticed before:
I didn't notice them before because this output wasn't near the test itself, but rather intermingled in the test output. Does that help at all? The connection terminating while reading from pipe sounds... significant 😅 The sandbox one is odd since I was running with no_sandbox: true. Maybe a tangent, but I was taking a look at the chrome docs and I thought these flags looked interesting. Not sure if they're relevant:
For completeness and in case it's relevant, I did want to note I have this in my ExUnit.after_suite(fn _ -> Supervisor.stop(MyApp.Supervisor) end) I saw it somewhere in the docs related to not properly shutting down. |
Hey @markquezada
But I'm also just guessing, to be honest. It sure looks like some kind of race condition at boot time. Further debugging ideas / questions:
|
This patch adds an `Inspect` implementation for `ChromicPDF.Protocol` that redacts any potentially sensitive data, leaving merely ChromicPDF internals and options. Next, the inspected protocol is included in the timeout error in `Channel.run_protocol/3`, as this is the error a lot of people run into in various situations: boot time, teardown, etc. See issues #160, #171 for instance.
This patch adds an `Inspect` implementation for `ChromicPDF.Protocol` that redacts any potentially sensitive data, leaving merely ChromicPDF internals and options. Next, the inspected protocol is included in the timeout error in `Channel.run_protocol/3`, as this is the error a lot of people run into in various situations: boot time, teardown, etc. See issues #160, #171 for instance.
Hi @markquezada Can we close this? Any news? |
Hi @maltoe, thanks for following up. It does still seem to be a problem for us. I haven't had time to circle back on this but I see that you released an update that could help me debug so I'll see if I can get you the debugging info you requested. |
Quick update: I added a sleep call in the test and so far none of our builds have failed. I'm going to let that run for another day or so and then I'll remove it to see if I can trigger the same error I was getting previously. |
@maltoe FYI, I've been slowly reducing the sleep time when printing in test and now it's back at 0 and I'm getting this occasionally:
I'm happy to simply keep the call to |
Yeah, that one we could probably address. You're still using the A bit of context: The
So for you to fix this: Use a queue like Oban to ensure limiting your concurrent @andreasknoepfle Regarding "that one we could probably address" -> I think we should rescue this error and explain this better. I'm unsure about adding a |
This patch adds a better error message for the `exit` signal sent by `NimblePool.checkout/3` when it fails to acquire a free worker. - Introduce `ExecutionError` - Also make the channel raise this error, seemed to make sense - Test for it - Also minor changes to the timeouts in the session: - Make `@close_checkout` its own thing and set it to 1s - Explicitly set the NimblePool checkout and also set it to 1s (default is actually 5s but I figured this error always means a big setup problem for the user, so we may as well be conservative?) Adresses [[#160]](#160)
This patch adds a better error message for the `exit` signal sent by `NimblePool.checkout/3` when it fails to acquire a free worker. - Introduce `ExecutionError` - Also make the channel raise this error, seemed to make sense - Test for it - Also minor changes to the timeouts in the session: - Make `@close_checkout` its own thing and set it to 1s - Explicitly set the NimblePool checkout and also set it to 1s (default is actually 5s but I figured this error always means a big setup problem for the user, so we may as well be conservative?) Adresses [[#160]](#160)
This patch adds a better error message for the `exit` signal sent by `NimblePool.checkout/3` when it fails to acquire a free worker. - Introduce `ExecutionError` - Also make the channel raise this error, seemed to make sense - Also minor changes to the timeouts in the session: - Make `@close_checkout` its own thing and set it to 1s - Explicitly set the NimblePool checkout and set it to 1s - default is actually 5s but I figured this error always means a big setup problem for the user, so we may as well be conservative? Adresses [[#160]](#160)
@markquezada addendum on my previous comment: @andreasknoepfle pointed out that you were only running a single test on CI, which I completely forgot in the meantime :) So I guess it's not really plausible that you max out the concurrency of the pool? In that case I'd offer the following theory:
I could simulate this by reducing the checkout timeout to 1ms and running this on the console:
|
FWIW I tried setting up Github actions for this repository, and look at the first error that jumps in my face: [edit] or look at this job: https://github.com/bitcrowd/chromic_pdf/actions/runs/3600997571/jobs/6066285321 all sorts of EXITs and timeouts |
This one may be relevant microsoft/playwright#4345 |
This patch adds a better error message for the `exit` signal sent by `NimblePool.checkout/3` when it fails to acquire a free worker. - Introduce `ExecutionError` - Also make the channel raise this error, seemed to make sense - Also minor changes to the timeouts in the session: - Make `@close_checkout` its own thing and set it to 1s - Explicitly set the NimblePool checkout and set it to 1s - default is actually 5s but I figured this error always means a big setup problem for the user, so we may as well be conservative? Adresses [[#160]](#160)
This patch adds a better error message for the `exit` signal sent by `NimblePool.checkout/3` when it fails to acquire a free worker. - Introduce `ExecutionError` - Also make the channel raise this error, seemed to make sense - Also minor changes to the timeouts in the session: - Make `@close_checkout` its own thing and set it to 1s - Explicitly set the NimblePool checkout and set it to 1s - default is actually 5s but I figured this error always means a big setup problem for the user, so we may as well be conservative? Adresses [[#160]](#160)
This patch adds a better error message for the `exit` signal sent by `NimblePool.checkout/3` when it fails to acquire a free worker. - Introduce `ExecutionError` - Also make the channel raise this error, seemed to make sense - Also minor changes to the timeouts in the session: - Make `@close_checkout` its own thing and set it to 1s - Make `@checkout_timeout` explicit (but leave it at default 5s) Adresses [[#160]](#160)
This patch improves the error handling of the session pool and the channel in various ways. - Introduce ExecutionError custom exception. - Make NimblePool.checkout/3 error more meaningful: The exit signal sent by NimblePool.checkout/3 is now caught and translated into a helpful ExecutionError. - Make the SessionPool.terminate_worker/3 callback gracefully terminate the target when the client failed with an error (channel timeout or inspectorCrashed now). - Minor changes to the timeouts in the session: - Make `@close_checkout` its own thing and set it to 1s. - Make `@checkout_timeout` explicit, but leave it at default 5s. - Turn the channel timeout error into an `ExecutionError` as well. - Pull out the warning about the inspectorCrashed message from the protocol into the channel. This would previously print the warning as many times as protocols were being processed, which wasn't ideal. Moved to the channel where it's printed only once. - Additionally, in case the session that received the inspectorCrashed message is currently executing a protocol, make that error with a specific error code and raise a specific exception as well (instead of the non-descriptive timeout). Adresses [[#160]](#160)
Ah, interesting. Yes, I'm only running the one test so the nimble pool error seemed suspect. Your theory sounds pretty spot on to me. FYI, here's my current config in test: config :my_app, ChromicPDF,
on_demand: false,
offline: true,
discard_stderr: false,
no_sandbox: true,
chrome_args: "--disable-dev-shm-usage",
session_pool: [init_timeout: 30_000, timeout: 30_000] I even tried making Chromic the first thing in my supervisor tree so that it boots up first but I still get this error occasionally.
Isn't that running headful chrome though? That ticket notes that headless chromimum is optimized for cold boots. One thing I just realized... I'm not specifying what chromium binary to use and I'm using the The software list for the ubuntu runner lists:
... so headful chrome looks to be installed. Based on the way the binary is located in chrome_impl.ex, it doesn't seem like it would choose headful chrome first, but I haven't verified which binary is getting run. Just a shot in the dark. Thanks for being receptive to this issue! |
Huh? AFAIK there is no dedicated Chrome build for "headless only" operation. I thought the ticket was related as it explicitly mentioned random delays on Github actions, but I have since experimented with that by measuring the time it takes To me it feels like there is something fishy on Github Actions. Chrome may be doing some work on start up that may delay it becoming responsive to devtools commands. Perhaps it's loading something from disk, or writing its profile, or the like; and for some reason on Github Actions that occasionally takes a long time. But that's really just a gutfeel, I'd like to investigate further, especially to find out where exactly it stalls the
You're welcome. Sorry that it has not been the greatest experience for you yet. ChromicPDF usually runs pretty stable on production, but we and other people had a lot of issues in CI already, as quickly launching and stopping Chrome instances seems to be something that is difficult to get to work reliably. |
@markquezada Still on my way of getting the test suite to run on Github actions (see #179), but it's looking good, i.e. I don't have random delays anymore. Maybe give 1.4.0 a try and see if one of the command line options I copied over from Puppeteer makes a difference. Cargo-cult programming at its finest. [EDIT] 😠 Nevermind, just got an init timeout. Forget what I said. Still in the complete dark about what's wrong here. |
A bit funny... we have dependabot set up to auto-create PRs when dependencies update. I was excited when I saw it come in with a ChromIcPDF update. But then CI failed on the PR 😭 Probably not that helpful, but for reference, here's the relevant messages in the logs:
|
I've looked into Github actions further today and established a few points:
-> It looks a lot like Chrome is initializing something on first run, perhaps some cache or so, and perhaps including writing to disk. And only sometimes Github Actions has trouble providing the required resources promptly. Perhaps some part of the filesystem is provisioned lazily. All just guesses though. Solutions
Could you do me a favor (hopefully the last) and try out #193 in your pipeline and add the |
@markquezada We're having the same issue. So far your warmup solution seems to be working after a few builds. I'll let you know if it still looks good after a few days. |
@maltoe I like this warmup solution. Seems like the best compromise. 👍 I've added it and will report back after a few days too. |
@maltoe Warmup has been working for a week now. I'd say we're good! |
Merged. Heads up, I renamed it to Curiously, it has started happening on our own CI now 🎉 Maybe it's something that has changed in recent Chromes (I've updated the Chrome version in the CI maybe a month ago)... Thanks for your help everyone! @markquezada, I'm going to interpret your silence as "it hasn't happened again" and going to close this issue for now. Feel free to reopen if necessary. |
Released as v1.5.0 |
@maltoe sorry, just saw this as I've been out on vacation. Yep, it all still seems to be working well with the warmup option for us! Thanks for your help and persistence in tracking this down and providing a fix! |
Hi. Thanks for this awesome library. In general, it works really well. We have no issues in development.
I tried to get ChromicPDF working in github actions CI and am getting this error/warning:
I'm using these options to configure Chromic:
In development, rendering takes around half a second and that's with
on_demand
set totrue
. This is for a one page PDF that's mostly just text.I'm using an
ubuntu-latest
github actions runner. The runner image includes google chrome, chromium and chrome driver so I did not do anything to download another version of chromium, etc.. When checking which version Chromic was using, it looks like it selected"/usr/bin/chromium-browser"
and I don't get any errors on boot about not being able to find chromium.Of note, when I SSH into the runner and try running the test suite manually, it passes without any issues. I also tried rendering a pdf manually in IEx on the runner image and I got a blob back almost instantly. But for some reason, as part of CI, I'm getting flaky tests that take a long time to run and sometimes timeout. Sometimes the tests are marked as "passed" even when I see these errors. Other times, I see these errors and the test suite is marked as failed. I can't discern any difference in output between those two though.
Here's the test I'm running:
Here's the
render
function:I've fiddled with this for many hours and can't seem to get a configuration that is reliable on github actions CI. Any thoughts?
The text was updated successfully, but these errors were encountered: