Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine unit and integration test steps into one stage #9733

Merged
merged 2 commits into from
Jan 11, 2022

Conversation

driazati
Copy link
Member

@driazati driazati commented Dec 13, 2021

This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel.

The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime. The CPU build was also running a bunch of unit tests, this breaks them out into their own job so the CPU build step shouldn't block the rest of CI as long. Test run

@areusch

@driazati driazati force-pushed the driazati/parallel_test branch 2 times, most recently from 89b306a to bdf9ba1 Compare December 14, 2021 18:30
This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel.

The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime.
@driazati driazati marked this pull request as ready for review December 16, 2021 00:15
@driazati driazati requested a review from a team as a code owner December 16, 2021 00:15
Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @driazati !

}
},
'python3: CPU': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think right now, failing unit tests wouldn't cancel integration tests. are we concerned with overburdening CI with PRs that fail unit tests? I wonder if we should somehow cancel integration test builds if unit tests fail. we could also just try merging and see if it's a problem, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The potential time savings of this are pretty significant so I think it's worth it even if it demands more capacity from CI. I could try to pull some numbers of # of jobs vs # of jobs with failing unit tests vs # of jobs with failing integration tests to back this up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some classifications over the last several failing PR jobs, seems like the unit test failures aren't super frequent compared to others so this PR is probably fine in terms of demand increase

build	262
lint	91
integration	46
infra	22
unit tests	19
unclassified	3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, main concern here is about the frontend tests i think. those do take over an hour to run, so that might be a big load increase if they don't get cancelled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry should have included this, but this is how I did the accounting:

WHEN name like '%Build and run C++ tests' THEN 'build'
WHEN name like '%Run cmake build' THEN 'build'
WHEN name like '%Run microTVM tests' THEN 'build'
WHEN name like '%integration tests' THEN 'integration'
WHEN name like '%frontend tests' THEN 'integration'
WHEN name like '%unit tests' THEN 'unit tests'
WHEN name like '%Sphinx warnings in docs' THEN 'lint'
WHEN name like '%Run lint' THEN 'lint'
WHEN name like '%executor node info' THEN 'infra'
WHEN name like '%Check out from version control' THEN 'infra'
WHEN name like '%JUnit-formatted test results' THEN 'infra'
WHEN name like '%Docker image names' THEN 'infra'
WHEN name like '%files previously stashed' THEN 'infra'
WHEN name like '%Rust build and test' THEN 'build'

Isn't the concern the case where unit tests that would have failed and then caused frontend (integration) tests to not run, so the number to look at is unit test failures (since with this PR those would no longer be gating frontend failures)? Frontend failures would mean that the build got all the way there anyways so the capacity requirements of those would be the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, given the data i support trying this change! we may need to evaluate its impact as folks start using it day-to-day.

}
}

stage('Integration Test') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however I'm not sure it's a good idea to run the frontend tests as non-cancellable without unit tests passing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's give this a shot and we can always bring forward the failFast logic from @Mousius now that we are going this route

@driazati driazati requested a review from areusch January 4, 2022 18:03
@areusch
Copy link
Contributor

areusch commented Jan 6, 2022

@areusch
Copy link
Contributor

areusch commented Jan 6, 2022

let's see if there are any additional thoughts from the community

@areusch
Copy link
Contributor

areusch commented Jan 6, 2022

cc @Mousius

@Mousius
Copy link
Member

Mousius commented Jan 6, 2022

hi @driazati / @areusch, I can't see the output in the other Jenkins as I'm not authorised but based on the Jenkinsfile this seems like a good thing to just try 😸

CI already gets super bogged down with many builds so I don't think there'll be a real difference to that situation by implementing this, but the happy path for when CI has capacity will be vastly improved leading to potentially less builds building up anyway. The later stages will be using more GPU instances which are currently never used unless we get that far in the pipeline, so this should be fairly safe and there's always the revert button if it really performs poorly.

I would definitely advocate for #9129 as well to maximize free capacity but that can be a future addition once we've seen how this works in practice.

P.S. It'd be nice to see these all as GitHub checks in parallel as well 😸

@driazati
Copy link
Member Author

driazati commented Jan 6, 2022

P.S. It'd be nice to see these all as GitHub checks in parallel as well 😸

agreed that'd be nice but I don't think Jenkins gives it out of the box, maybe something like [this plugin](https://plugins.jenkins.io/github-autostatus/, this or we could manually report statuses by curl-ing at github

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's try this. i'll approve it after ci-docker-staging succeeds

}
},
'python3: CPU': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, given the data i support trying this change! we may need to evaluate its impact as folks start using it day-to-day.

}
}

stage('Integration Test') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's give this a shot and we can always bring forward the failFast logic from @Mousius now that we are going this route

Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's try this. i'll approve it after ci-docker-staging succeeds

Looks like it passed, so I'm going to drop this in.

@Mousius Mousius merged commit 31c22c5 into apache:main Jan 11, 2022
Raghav-Chakravarthy pushed a commit to Raghav-Chakravarthy/tvm that referenced this pull request Jan 28, 2022
This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel.

The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime.

Co-authored-by: driazati <driazati@users.noreply.github.com>
ylc pushed a commit to ylc/tvm that referenced this pull request Feb 16, 2022
This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel.

The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime.

Co-authored-by: driazati <driazati@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants