Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NETBEANS-5478]: DownloadBinaries task: 5 secs connect time is not enough #2826

Merged
merged 3 commits into from
Apr 1, 2021
Merged

[NETBEANS-5478]: DownloadBinaries task: 5 secs connect time is not enough #2826

merged 3 commits into from
Apr 1, 2021

Conversation

lbruun
Copy link
Contributor

@lbruun lbruun commented Mar 22, 2021

This is an improvement on the NetBeans build system, specifically the task which downloads binary prerequisites from Central Maven or https://netbeans.osuosl.org/binaries.

As per the explanation in NETBEANS-5478 a hard-coded connect timeout of 5 seconds will not work in some build environments because of the "network first touch penalty" (I'm looking at you GitHub Actions (= Azure)). This PR makes the connect timeout configurable so that it will be possible to do:

ant -DdownloadBinaries.connectTimeoutMs=30000 ....

(which will set the connect timeout to 30 seconds)

Current behavior is not changed, i.e the default is 5000 mseconds ... which I personally find low'ish, but admittedly has worked well for most users over the years. It just doesn't work in environments which need "warmup".

Note: This was very hard to track down. The reason is that DownloadBinaries.java swallows exceptions, see line 212 and line 309. This PR does not fix that in order to stay on a single topic for the PR.

The current harcoded 5 seconds value simply isn't enough on some platforms.
The value can now be overridden from the Ant command line, such as:

ant -DdownloadBinaries.connectTimeoutMs=30000 ....

(which will set the connect timeout to 30 secs)
@matthiasblaesing
Copy link
Contributor

I'm mildly surprised, that in the current cloud world a connect time of 5s is not enough, I'm wondering though, whether it would make sense to make use of this by modifying the CI tasks to take advantage of this. The change looks sane to me apart from that.

@neilcsmith-net
Copy link
Member

If this is the cause of the various CI errors we get with downloading, what's the downside of making the default much longer (eg. 30s)? Given we'd have to modify tasks in at least 3 different places otherwise.

I should have stopped it swallowing that exception when I edited that loop recently. Caching and logging the last exception would make sense.

@lbruun
Copy link
Contributor Author

lbruun commented Mar 22, 2021

I'm mildly surprised, that in the current cloud world a connect time of 5s is not enough,

Me too. However Cloud PaaS platforms (I suspect that GitHub Actions use some Azure PaaS wodoo) are known to use serverless as part of their infra, for example an outbound FW may be based on Azure Functions and thus have a cold start penalty. Now, to make matters worse, this infra may be shared between ephemeral environments, so one run may experience the problem while another doesn't. It will seem random. Just speculation, though.

For me the problem has escalated over the past 1-2 months, before that it was a very, very rare occurrence. So something may have changed on the GitHub side, dunno. There are quite a few people reporting something similar in the GitHub issue tracker.

It is also worth mentioning why the existing NetBeans GitHub Actions jobs do not - to my knowledge - experience a problem in this respect: they use the GitHub Actions cache component. As long as this cache has been build successfully once, no downloads will be attempted by the Ant script (the DownloadBinaries task will see the artifact is already there on disk). The cache will be re-used forever, it is used at least every 7 days and will therefore never fall victim to GitHub's delete strategy. There's no problem per-se that the build jobs use this cache, as long as external artifact versions are immutable. But pray we do not have to re-build it.

@lbruun
Copy link
Contributor Author

lbruun commented Mar 22, 2021

If this is the cause of the various CI errors we get with downloading, what's the downside of making the default much longer (eg. 30s)? Given we'd have to modify tasks in at least 3 different places otherwise.

Nothing, really. I just didn't want to touch the default in case someone was relying on the current fail-fast behavior.

5 seconds is indeed very small. For example, an URLConnection on Windows will use 21 secs as its default connect timeout (URLConnection will delegate to the OS if timeout is not set explicitly). On CentOS 7 it is roughly 15 secs.

@jlahoda
Copy link
Contributor

jlahoda commented Mar 22, 2021

Thanks for doing this! I ran into this problem a several times, but never really got to submitting a PR, this is very welcome. I'd be +1 on making the timeout longer - 15s would seem fine to me, but not a really strong opinion.

@lbruun
Copy link
Contributor Author

lbruun commented Mar 22, 2021

I've now updated the PR: the default connect timeout is now 15 secs, up from 5 secs. I think this is more than sufficient that we do not have to change existing CI build jobs in order to benefit from the change.

@lbruun
Copy link
Contributor Author

lbruun commented Apr 1, 2021

Merging this one. There seems to be agreement.

@lbruun lbruun merged commit 3ea8571 into apache:master Apr 1, 2021
@lbruun lbruun deleted the NETBEANS-5478-DownloadBinaries-improvement branch April 1, 2021 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants