Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Pass failFast flag to Jenkins parallel #9129

Closed
wants to merge 1 commit into from

Conversation

Mousius
Copy link
Member

@Mousius Mousius commented Sep 26, 2021

This should cause the entire parallel branch to fail if an individual job fails - freeing up executors for other jobs rather than holding them for hours.

@Mousius Mousius requested a review from a team as a code owner September 26, 2021 14:36
@areusch
Copy link
Contributor

areusch commented Sep 27, 2021

people have actually sometimes asked for the opposite here--"if i submit something to Jenkins, i want it to run through everything so i don't just get the first error." perhaps we should discuss more broadly first.

@Mousius
Copy link
Member Author

Mousius commented Sep 28, 2021

people have actually sometimes asked for the opposite here--"if i submit something to Jenkins, i want it to run through everything so i don't just get the first error." perhaps we should discuss more broadly first.

I raised this purely as a way to free up executors so CI doesn't take as long to complete, motivated by comments on PRs I was reviewing such as #8974 (comment).

I'd suggest CI isn't a drop-in replacement for a functional development environment and people should be running as much validation as possible before they submit something as a PR. Happy to raise a discuss post though 😸

@jroesch
Copy link
Member

jroesch commented Sep 28, 2021

@areusch I feel like this is a great forcing function to fix people's local setup, CI should not be people's personal testing environment (even if some people use it that way). If the pain becomes more urgent we will be more strongly motivated to fix it.

@areusch
Copy link
Contributor

areusch commented Oct 3, 2021

so originally the request was essentially around the integration tests, which we run in smaller sets (e.g. relay, topi, etc). when a test in the early set fails, results from the later ones aren't reported. this change isn't quite the same--but, it's the same argument as to why you may not want fail-fast; for example, if a test fails in the ci_arm container, you may not know whether it's also failing in ci_gpu or vice versa. i agree CI is not a personal testing environment, but it is sometimes the easiest way for developers to access cloud platforms they don't have e.g. arm, gpu.

@Mousius the comment you referenced is a bit more general and i'm not sure this specific issue contributes to CI taking a while to complete. you can monitor CI if you're anxious for the test results. one effort in progress is the xdist which should have a bit bigger impact without potentially making it harder to access a test platform you don't have locally. i'm not opposed to changing CI to improve developer productivity, but could you motivate this specific change a bit more? in practice this seems most likely to result in cancellation of GPU integration tests, but the number of available GPU executors has not been 0 in the past month. perhaps we should track that stat for a bit now that #9128 is in. i am wondering if maybe it already somewhat addressed this concern.

@jroesch your comment is a bit generic. i still would like to see more rationale as to why cancelling the GPU unit tests when an ARM one fails.

@Mousius
Copy link
Member Author

Mousius commented Oct 4, 2021

Hi @areusch,

i agree CI is not a personal testing environment, but it is sometimes the easiest way for developers to access cloud platforms they don't have e.g. arm, gpu.

I do empathise with this, but I don't think we should design a CI solution around the edge cases, by reducing the overall running jobs we can get to these faster when they do arise.

@Mousius the comment you referenced is a bit more general and i'm not sure this specific issue contributes to CI taking a while to complete. you can monitor CI if you're anxious for the test results.

There's two things this change fixes:

  1. Machine availability - we keep overall machines free-er to start a job than they previously were as we fail out of them faster
  2. Machine saturation - running multiple tasks on a single machine is going to result in n slow jobs, the fewer jobs you run the more compute you have free.

I don't rely on CI for test results, but I can definitely feel the reluctance of waiting for CI to complete once you have a green tick given your change is then delayed to likely the next day each time.

in practice this seems most likely to result in cancellation of GPU integration tests, but the number of available GPU executors has not been 0 in the past month. perhaps we should track that stat for a bit now that #9128 is in. i am wondering if maybe it already somewhat addressed this concern.

We should be very careful about considering the number of executors available as a metric as to how efficient CI is. When a Jenkins agent is under load from one set of branch builds it will have a negative effect on any other thing also running - so whilst we may never run out of executors on paper, this change would result in them being less loaded and thus more efficient at running CI jobs.

@areusch
Copy link
Contributor

areusch commented Oct 5, 2021

@Mousius

I do empathise with this, but I don't think we should design a CI solution around the edge cases, by reducing the overall running jobs we can get to these faster when they do arise.

I kind of agree, but not sure 100% here. For example, suppose there are iterative test failures on ci-arm as well as test failures on ci-gpu, neither of which you have available locally. If you push to CI, you'll wind up using resources to rebuild on all platforms.

There's two things this change fixes:

  1. Machine availability - we keep overall machines free-er to start a job than they previously were as we fail out of them faster
  2. Machine saturation - running multiple tasks on a single machine is going to result in n slow jobs, the fewer jobs you run the more compute you have free.
    I don't rely on CI for test results, but I can definitely feel the reluctance of waiting for CI to complete once you have a green tick given your change is then delayed to likely the next day each time.

I guess I am open to trying this, but I feel a bit like we should publicize this in the forum in case anyone else is attached to the current setup. My example came from an internal OctoML ask of me. I think I feel like this because I'm not sure we have hard metrics to consult.

We should be very careful about considering the number of executors available as a metric as to how efficient CI is. When a Jenkins agent is under load from one set of branch builds it will have a negative effect on any other thing also running - so whilst we may never run out of executors on paper, this change would result in them being less loaded and thus more efficient at running CI jobs.

Ah! I agree this is the case right now, but I am sort of scheming to change this with the xdist work. Right now it is indeed possible to run > 1 PR on a node at once. With xdist, it'll be possible to use the entire node's resources on a single PR. I haven't worked out the details yet, but would propose we do this for test-cases and then it will be possible to treat queue-depth as a measure of CI load :).

This should cause the entire parallel branch to fail if an individual job fails - freeing up executors for other jobs rather than holding them for hours.
@Mousius
Copy link
Member Author

Mousius commented Oct 5, 2021

I guess I am open to trying this, but I feel a bit like we should publicize this in the forum in case anyone else is attached to the current setup. My example came from an internal OctoML ask of me. I think I feel like this because I'm not sure we have hard metrics to consult.

I echo @jroesch's concerns here, unless we push the issue people will remain with what's comfortable and use the resources for other things. I'd be interested in trying it in the spirit of trying an improvement and see what issues arise? I actually have a GPU handy yet I haven't configured my local tests to use a GPU which meant I was hit by it after switching context in #9190 via CI, as a guilty party I genuinely believe that being forced to properly set this up is the correct approach.

I don't mind raising a Discuss thread if it's necessary, but I'm concerned we're going to have to integrate everyone's personal development practices rather than optimising for the CI use case which the system is meant for.

@areusch
Copy link
Contributor

areusch commented Jan 6, 2022

@driazati might be relevant to #9733

@Mousius
Copy link
Member Author

Mousius commented Jan 11, 2022

Closing this whilst we experiment with #9733

@Mousius Mousius closed this Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants