[CI] Pass failFast flag to Jenkins parallel #9129

Mousius · 2021-09-26T14:36:32Z

This should cause the entire parallel branch to fail if an individual job fails - freeing up executors for other jobs rather than holding them for hours.

areusch · 2021-09-27T20:14:11Z

people have actually sometimes asked for the opposite here--"if i submit something to Jenkins, i want it to run through everything so i don't just get the first error." perhaps we should discuss more broadly first.

Mousius · 2021-09-28T10:35:43Z

people have actually sometimes asked for the opposite here--"if i submit something to Jenkins, i want it to run through everything so i don't just get the first error." perhaps we should discuss more broadly first.

I raised this purely as a way to free up executors so CI doesn't take as long to complete, motivated by comments on PRs I was reviewing such as #8974 (comment).

I'd suggest CI isn't a drop-in replacement for a functional development environment and people should be running as much validation as possible before they submit something as a PR. Happy to raise a discuss post though 😸

jroesch · 2021-09-28T23:16:53Z

@areusch I feel like this is a great forcing function to fix people's local setup, CI should not be people's personal testing environment (even if some people use it that way). If the pain becomes more urgent we will be more strongly motivated to fix it.

areusch · 2021-10-03T17:50:53Z

so originally the request was essentially around the integration tests, which we run in smaller sets (e.g. relay, topi, etc). when a test in the early set fails, results from the later ones aren't reported. this change isn't quite the same--but, it's the same argument as to why you may not want fail-fast; for example, if a test fails in the ci_arm container, you may not know whether it's also failing in ci_gpu or vice versa. i agree CI is not a personal testing environment, but it is sometimes the easiest way for developers to access cloud platforms they don't have e.g. arm, gpu.

@Mousius the comment you referenced is a bit more general and i'm not sure this specific issue contributes to CI taking a while to complete. you can monitor CI if you're anxious for the test results. one effort in progress is the xdist which should have a bit bigger impact without potentially making it harder to access a test platform you don't have locally. i'm not opposed to changing CI to improve developer productivity, but could you motivate this specific change a bit more? in practice this seems most likely to result in cancellation of GPU integration tests, but the number of available GPU executors has not been 0 in the past month. perhaps we should track that stat for a bit now that #9128 is in. i am wondering if maybe it already somewhat addressed this concern.

@jroesch your comment is a bit generic. i still would like to see more rationale as to why cancelling the GPU unit tests when an ARM one fails.

Mousius · 2021-10-04T18:14:14Z

Hi @areusch,

i agree CI is not a personal testing environment, but it is sometimes the easiest way for developers to access cloud platforms they don't have e.g. arm, gpu.

I do empathise with this, but I don't think we should design a CI solution around the edge cases, by reducing the overall running jobs we can get to these faster when they do arise.

@Mousius the comment you referenced is a bit more general and i'm not sure this specific issue contributes to CI taking a while to complete. you can monitor CI if you're anxious for the test results.

There's two things this change fixes:

Machine availability - we keep overall machines free-er to start a job than they previously were as we fail out of them faster
Machine saturation - running multiple tasks on a single machine is going to result in n slow jobs, the fewer jobs you run the more compute you have free.

I don't rely on CI for test results, but I can definitely feel the reluctance of waiting for CI to complete once you have a green tick given your change is then delayed to likely the next day each time.

in practice this seems most likely to result in cancellation of GPU integration tests, but the number of available GPU executors has not been 0 in the past month. perhaps we should track that stat for a bit now that #9128 is in. i am wondering if maybe it already somewhat addressed this concern.

We should be very careful about considering the number of executors available as a metric as to how efficient CI is. When a Jenkins agent is under load from one set of branch builds it will have a negative effect on any other thing also running - so whilst we may never run out of executors on paper, this change would result in them being less loaded and thus more efficient at running CI jobs.

areusch · 2021-10-05T00:21:04Z

@Mousius

I do empathise with this, but I don't think we should design a CI solution around the edge cases, by reducing the overall running jobs we can get to these faster when they do arise.

I kind of agree, but not sure 100% here. For example, suppose there are iterative test failures on ci-arm as well as test failures on ci-gpu, neither of which you have available locally. If you push to CI, you'll wind up using resources to rebuild on all platforms.

There's two things this change fixes:

Machine availability - we keep overall machines free-er to start a job than they previously were as we fail out of them faster

Machine saturation - running multiple tasks on a single machine is going to result in n slow jobs, the fewer jobs you run the more compute you have free.
I don't rely on CI for test results, but I can definitely feel the reluctance of waiting for CI to complete once you have a green tick given your change is then delayed to likely the next day each time.

I guess I am open to trying this, but I feel a bit like we should publicize this in the forum in case anyone else is attached to the current setup. My example came from an internal OctoML ask of me. I think I feel like this because I'm not sure we have hard metrics to consult.

We should be very careful about considering the number of executors available as a metric as to how efficient CI is. When a Jenkins agent is under load from one set of branch builds it will have a negative effect on any other thing also running - so whilst we may never run out of executors on paper, this change would result in them being less loaded and thus more efficient at running CI jobs.

Ah! I agree this is the case right now, but I am sort of scheming to change this with the xdist work. Right now it is indeed possible to run > 1 PR on a node at once. With xdist, it'll be possible to use the entire node's resources on a single PR. I haven't worked out the details yet, but would propose we do this for test-cases and then it will be possible to treat queue-depth as a measure of CI load :).

This should cause the entire parallel branch to fail if an individual job fails - freeing up executors for other jobs rather than holding them for hours.

Mousius · 2021-10-05T13:10:25Z

I guess I am open to trying this, but I feel a bit like we should publicize this in the forum in case anyone else is attached to the current setup. My example came from an internal OctoML ask of me. I think I feel like this because I'm not sure we have hard metrics to consult.

I echo @jroesch's concerns here, unless we push the issue people will remain with what's comfortable and use the resources for other things. I'd be interested in trying it in the spirit of trying an improvement and see what issues arise? I actually have a GPU handy yet I haven't configured my local tests to use a GPU which meant I was hit by it after switching context in #9190 via CI, as a guilty party I genuinely believe that being forced to properly set this up is the correct approach.

I don't mind raising a Discuss thread if it's necessary, but I'm concerned we're going to have to integrate everyone's personal development practices rather than optimising for the CI use case which the system is meant for.

areusch · 2022-01-06T00:52:24Z

@driazati might be relevant to #9733

Mousius · 2022-01-11T09:47:47Z

Closing this whilst we experiment with #9733

Mousius requested a review from a team as a code owner September 26, 2021 14:36

[CI] Pass failFast flag to Jenkins parallel

23087bd

This should cause the entire parallel branch to fail if an individual job fails - freeing up executors for other jobs rather than holding them for hours.

Mousius force-pushed the fail-fast branch from e827fa8 to 23087bd Compare October 5, 2021 13:00

Mousius mentioned this pull request Jan 6, 2022

Combine unit and integration test steps into one stage #9733

Merged

Mousius closed this Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Pass failFast flag to Jenkins parallel #9129

[CI] Pass failFast flag to Jenkins parallel #9129

Mousius commented Sep 26, 2021

areusch commented Sep 27, 2021

Mousius commented Sep 28, 2021

jroesch commented Sep 28, 2021

areusch commented Oct 3, 2021

Mousius commented Oct 4, 2021

areusch commented Oct 5, 2021

Mousius commented Oct 5, 2021

areusch commented Jan 6, 2022

Mousius commented Jan 11, 2022

[CI] Pass failFast flag to Jenkins parallel #9129

[CI] Pass failFast flag to Jenkins parallel #9129

Conversation

Mousius commented Sep 26, 2021

areusch commented Sep 27, 2021

Mousius commented Sep 28, 2021

jroesch commented Sep 28, 2021

areusch commented Oct 3, 2021

Mousius commented Oct 4, 2021

areusch commented Oct 5, 2021

Mousius commented Oct 5, 2021

areusch commented Jan 6, 2022

Mousius commented Jan 11, 2022