Combine unit and integration test steps into one stage #9733

driazati · 2021-12-13T22:11:31Z

This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel.

The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime. The CPU build was also running a bunch of unit tests, this breaks them out into their own job so the CPU build step shouldn't block the rest of CI as long. Test run

@areusch

This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel. The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime.

areusch

thanks @driazati !

areusch · 2022-01-03T22:10:05Z

Jenkinsfile

      }
    },
-    'python3: CPU': {


i think right now, failing unit tests wouldn't cancel integration tests. are we concerned with overburdening CI with PRs that fail unit tests? I wonder if we should somehow cancel integration test builds if unit tests fail. we could also just try merging and see if it's a problem, too.

The potential time savings of this are pretty significant so I think it's worth it even if it demands more capacity from CI. I could try to pull some numbers of # of jobs vs # of jobs with failing unit tests vs # of jobs with failing integration tests to back this up

Here are some classifications over the last several failing PR jobs, seems like the unit test failures aren't super frequent compared to others so this PR is probably fine in terms of demand increase

build 262 lint 91 integration 46 infra 22 unit tests 19 unclassified 3

ok, main concern here is about the frontend tests i think. those do take over an hour to run, so that might be a big load increase if they don't get cancelled.

sorry should have included this, but this is how I did the accounting:

WHEN name like '%Build and run C++ tests' THEN 'build' WHEN name like '%Run cmake build' THEN 'build' WHEN name like '%Run microTVM tests' THEN 'build' WHEN name like '%integration tests' THEN 'integration' WHEN name like '%frontend tests' THEN 'integration' WHEN name like '%unit tests' THEN 'unit tests' WHEN name like '%Sphinx warnings in docs' THEN 'lint' WHEN name like '%Run lint' THEN 'lint' WHEN name like '%executor node info' THEN 'infra' WHEN name like '%Check out from version control' THEN 'infra' WHEN name like '%JUnit-formatted test results' THEN 'infra' WHEN name like '%Docker image names' THEN 'infra' WHEN name like '%files previously stashed' THEN 'infra' WHEN name like '%Rust build and test' THEN 'build'

Isn't the concern the case where unit tests that would have failed and then caused frontend (integration) tests to not run, so the number to look at is unit test failures (since with this PR those would no longer be gating frontend failures)? Frontend failures would mean that the build got all the way there anyways so the capacity requirements of those would be the same

ok, given the data i support trying this change! we may need to evaluate its impact as folks start using it day-to-day.

areusch · 2022-01-03T22:35:45Z

Jenkinsfile

-    }
-}
-
-stage('Integration Test') {


however I'm not sure it's a good idea to run the frontend tests as non-cancellable without unit tests passing

let's give this a shot and we can always bring forward the failFast logic from @Mousius now that we are going this route

areusch · 2022-01-06T00:45:13Z

cc @tqchen @junrushao1994 @jroesch @leandron

areusch · 2022-01-06T00:45:36Z

let's see if there are any additional thoughts from the community

areusch · 2022-01-06T00:52:18Z

cc @Mousius

Mousius · 2022-01-06T10:15:07Z

hi @driazati / @areusch, I can't see the output in the other Jenkins as I'm not authorised but based on the Jenkinsfile this seems like a good thing to just try 😸

CI already gets super bogged down with many builds so I don't think there'll be a real difference to that situation by implementing this, but the happy path for when CI has capacity will be vastly improved leading to potentially less builds building up anyway. The later stages will be using more GPU instances which are currently never used unless we get that far in the pipeline, so this should be fairly safe and there's always the revert button if it really performs poorly.

I would definitely advocate for #9129 as well to maximize free capacity but that can be a future addition once we've seen how this works in practice.

P.S. It'd be nice to see these all as GitHub checks in parallel as well 😸

driazati · 2022-01-06T23:33:29Z

P.S. It'd be nice to see these all as GitHub checks in parallel as well 😸

agreed that'd be nice but I don't think Jenkins gives it out of the box, maybe something like [this plugin](https://plugins.jenkins.io/github-autostatus/, this or we could manually report statuses by curl-ing at github

areusch

ok let's try this. i'll approve it after ci-docker-staging succeeds

areusch · 2022-01-11T01:09:16Z

Jenkinsfile

      }
    },
-    'python3: CPU': {


ok, given the data i support trying this change! we may need to evaluate its impact as folks start using it day-to-day.

areusch · 2022-01-11T01:12:49Z

Jenkinsfile

-    }
-}
-
-stage('Integration Test') {


let's give this a shot and we can always bring forward the failFast logic from @Mousius now that we are going this route

Mousius

ok let's try this. i'll approve it after ci-docker-staging succeeds

Looks like it passed, so I'm going to drop this in.

This removes the barrier wait between test and integration tests in CI. This will increase capacity requirements and usage but, assuming we can meet that with autoscaler, should reduce CI times by an hour or two since we're doing all the testing in parallel. The slow path is CPU unit test -> GPU frontend tests, so kicking off the GPU frontend tests faster should help decrease CI runtime. Co-authored-by: driazati <driazati@users.noreply.github.com>

driazati force-pushed the driazati/parallel_test branch 2 times, most recently from 89b306a to bdf9ba1 Compare December 14, 2021 18:30

driazati added 2 commits December 15, 2021 16:13

blank

5c4a6fa

driazati force-pushed the driazati/parallel_test branch from bdf9ba1 to 5c4a6fa Compare December 16, 2021 00:13

driazati marked this pull request as ready for review December 16, 2021 00:15

driazati requested a review from a team as a code owner December 16, 2021 00:15

areusch reviewed Jan 3, 2022

View reviewed changes

driazati requested a review from areusch January 4, 2022 18:03

areusch mentioned this pull request Jan 6, 2022

[CI] Pass failFast flag to Jenkins parallel #9129

Closed

areusch reviewed Jan 11, 2022

View reviewed changes

Mousius approved these changes Jan 11, 2022

View reviewed changes

Mousius merged commit 31c22c5 into apache:main Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine unit and integration test steps into one stage #9733

Combine unit and integration test steps into one stage #9733

driazati commented Dec 13, 2021 •

edited

Loading

areusch left a comment

areusch Jan 3, 2022

driazati Jan 3, 2022

driazati Jan 5, 2022

areusch Jan 6, 2022

driazati Jan 6, 2022

areusch Jan 11, 2022

areusch Jan 3, 2022

areusch Jan 11, 2022

areusch commented Jan 6, 2022

areusch commented Jan 6, 2022

areusch commented Jan 6, 2022

Mousius commented Jan 6, 2022

driazati commented Jan 6, 2022

areusch left a comment

areusch Jan 11, 2022

areusch Jan 11, 2022

Mousius left a comment

Combine unit and integration test steps into one stage #9733

Combine unit and integration test steps into one stage #9733

Conversation

driazati commented Dec 13, 2021 • edited Loading

areusch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areusch commented Jan 6, 2022

areusch commented Jan 6, 2022

areusch commented Jan 6, 2022

Mousius commented Jan 6, 2022

driazati commented Jan 6, 2022

areusch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mousius left a comment

Choose a reason for hiding this comment

driazati commented Dec 13, 2021 •

edited

Loading