-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine unit and integration test steps into one stage #9733
Conversation
89b306a
to
bdf9ba1
Compare
This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel. The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime.
bdf9ba1
to
5c4a6fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @driazati !
} | ||
}, | ||
'python3: CPU': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think right now, failing unit tests wouldn't cancel integration tests. are we concerned with overburdening CI with PRs that fail unit tests? I wonder if we should somehow cancel integration test builds if unit tests fail. we could also just try merging and see if it's a problem, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The potential time savings of this are pretty significant so I think it's worth it even if it demands more capacity from CI. I could try to pull some numbers of # of jobs vs # of jobs with failing unit tests vs # of jobs with failing integration tests to back this up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some classifications over the last several failing PR jobs, seems like the unit test failures aren't super frequent compared to others so this PR is probably fine in terms of demand increase
build 262
lint 91
integration 46
infra 22
unit tests 19
unclassified 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, main concern here is about the frontend tests i think. those do take over an hour to run, so that might be a big load increase if they don't get cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry should have included this, but this is how I did the accounting:
WHEN name like '%Build and run C++ tests' THEN 'build'
WHEN name like '%Run cmake build' THEN 'build'
WHEN name like '%Run microTVM tests' THEN 'build'
WHEN name like '%integration tests' THEN 'integration'
WHEN name like '%frontend tests' THEN 'integration'
WHEN name like '%unit tests' THEN 'unit tests'
WHEN name like '%Sphinx warnings in docs' THEN 'lint'
WHEN name like '%Run lint' THEN 'lint'
WHEN name like '%executor node info' THEN 'infra'
WHEN name like '%Check out from version control' THEN 'infra'
WHEN name like '%JUnit-formatted test results' THEN 'infra'
WHEN name like '%Docker image names' THEN 'infra'
WHEN name like '%files previously stashed' THEN 'infra'
WHEN name like '%Rust build and test' THEN 'build'
Isn't the concern the case where unit tests that would have failed and then caused frontend (integration) tests to not run, so the number to look at is unit test failures (since with this PR those would no longer be gating frontend failures)? Frontend failures would mean that the build got all the way there anyways so the capacity requirements of those would be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, given the data i support trying this change! we may need to evaluate its impact as folks start using it day-to-day.
} | ||
} | ||
|
||
stage('Integration Test') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however I'm not sure it's a good idea to run the frontend tests as non-cancellable without unit tests passing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's give this a shot and we can always bring forward the failFast logic from @Mousius now that we are going this route
let's see if there are any additional thoughts from the community |
cc @Mousius |
hi @driazati / @areusch, I can't see the output in the other Jenkins as I'm not authorised but based on the Jenkinsfile this seems like a good thing to just try 😸 CI already gets super bogged down with many builds so I don't think there'll be a real difference to that situation by implementing this, but the happy path for when CI has capacity will be vastly improved leading to potentially less builds building up anyway. The later stages will be using more GPU instances which are currently never used unless we get that far in the pipeline, so this should be fairly safe and there's always the revert button if it really performs poorly. I would definitely advocate for #9129 as well to maximize free capacity but that can be a future addition once we've seen how this works in practice. P.S. It'd be nice to see these all as GitHub checks in parallel as well 😸 |
agreed that'd be nice but I don't think Jenkins gives it out of the box, maybe something like [this plugin](https://plugins.jenkins.io/github-autostatus/, this or we could manually report statuses by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let's try this. i'll approve it after ci-docker-staging succeeds
} | ||
}, | ||
'python3: CPU': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, given the data i support trying this change! we may need to evaluate its impact as folks start using it day-to-day.
} | ||
} | ||
|
||
stage('Integration Test') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's give this a shot and we can always bring forward the failFast logic from @Mousius now that we are going this route
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let's try this. i'll approve it after ci-docker-staging succeeds
Looks like it passed, so I'm going to drop this in.
This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel. The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime. Co-authored-by: driazati <driazati@users.noreply.github.com>
This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel. The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime. Co-authored-by: driazati <driazati@users.noreply.github.com>
This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel.
The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime. The CPU build was also running a bunch of unit tests, this breaks them out into their own job so the CPU build step shouldn't block the rest of CI as long. Test run
@areusch