-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Pass failFast flag to Jenkins parallel #9129
Conversation
people have actually sometimes asked for the opposite here--"if i submit something to Jenkins, i want it to run through everything so i don't just get the first error." perhaps we should discuss more broadly first. |
I raised this purely as a way to free up executors so CI doesn't take as long to complete, motivated by comments on PRs I was reviewing such as #8974 (comment). I'd suggest CI isn't a drop-in replacement for a functional development environment and people should be running as much validation as possible before they submit something as a PR. Happy to raise a discuss post though 😸 |
@areusch I feel like this is a great forcing function to fix people's local setup, CI should not be people's personal testing environment (even if some people use it that way). If the pain becomes more urgent we will be more strongly motivated to fix it. |
so originally the request was essentially around the integration tests, which we run in smaller sets (e.g. relay, topi, etc). when a test in the early set fails, results from the later ones aren't reported. this change isn't quite the same--but, it's the same argument as to why you may not want fail-fast; for example, if a test fails in the @Mousius the comment you referenced is a bit more general and i'm not sure this specific issue contributes to CI taking a while to complete. you can monitor CI if you're anxious for the test results. one effort in progress is the @jroesch your comment is a bit generic. i still would like to see more rationale as to why cancelling the GPU unit tests when an ARM one fails. |
Hi @areusch,
I do empathise with this, but I don't think we should design a CI solution around the edge cases, by reducing the overall running jobs we can get to these faster when they do arise.
There's two things this change fixes:
I don't rely on CI for test results, but I can definitely feel the reluctance of waiting for CI to complete once you have a green tick given your change is then delayed to likely the next day each time.
We should be very careful about considering the number of executors available as a metric as to how efficient CI is. When a Jenkins agent is under load from one set of branch builds it will have a negative effect on any other thing also running - so whilst we may never run out of executors on paper, this change would result in them being less loaded and thus more efficient at running CI jobs. |
I kind of agree, but not sure 100% here. For example, suppose there are iterative test failures on ci-arm as well as test failures on ci-gpu, neither of which you have available locally. If you push to CI, you'll wind up using resources to rebuild on all platforms.
I guess I am open to trying this, but I feel a bit like we should publicize this in the forum in case anyone else is attached to the current setup. My example came from an internal OctoML ask of me. I think I feel like this because I'm not sure we have hard metrics to consult.
Ah! I agree this is the case right now, but I am sort of scheming to change this with the |
This should cause the entire parallel branch to fail if an individual job fails - freeing up executors for other jobs rather than holding them for hours.
I echo @jroesch's concerns here, unless we push the issue people will remain with what's comfortable and use the resources for other things. I'd be interested in trying it in the spirit of trying an improvement and see what issues arise? I actually have a GPU handy yet I haven't configured my local tests to use a GPU which meant I was hit by it after switching context in #9190 via CI, as a guilty party I genuinely believe that being forced to properly set this up is the correct approach. I don't mind raising a Discuss thread if it's necessary, but I'm concerned we're going to have to integrate everyone's personal development practices rather than optimising for the CI use case which the system is meant for. |
Closing this whilst we experiment with #9733 |
This should cause the entire parallel branch to fail if an individual job fails - freeing up executors for other jobs rather than holding them for hours.