Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Build Times #621

Closed
cgardens opened this issue Oct 19, 2020 · 8 comments
Closed

Improve Build Times #621

cgardens opened this issue Oct 19, 2020 · 8 comments
Labels
build team/prod-eng type/enhancement New feature or request type/refactoring Focused on refactoring.

Comments

@cgardens
Copy link
Contributor

cgardens commented Oct 19, 2020

Tell us about the problem you're trying to solve

  • The build grows longer. Our patience grows short.

Describe the solution you’d like

  • Get the build back under 15 min (at least). Preferable under 10 min.

┆Issue is synchronized with this Asana task by Unito

@cgardens cgardens added type/enhancement New feature or request type/refactoring Focused on refactoring. labels Oct 19, 2020
@cgardens cgardens added this to the v0.3.0 milestone Oct 19, 2020
@sherifnada
Copy link
Contributor

sherifnada commented Oct 21, 2020

Context dump from a one-day investigation.

At a high level there are a few major improvements we can make to improve the build:

  1. Provide more resources to the build node. This should be effective because anecdotally, when running the build on my 6-core 32GB i7 laptop, a clean build takes ~7 minutes compared to the ~17 minutes it takes in GH Actions CI. This optimization would also allow us to take advantage of more optimizations like better parallelization (e.g: don't wait on unit tests before running integration tests or building dependent docker images).
  2. Cache Gradle task output for re-use across builds. With this change, a CI run would re-build the tasks whose inputs have changed from scratch but load the output of tasks whose inputs have not changed from the cross-build cache.
  3. Cache intermediate docker layers across builds aka retain the Docker cache across builds. The performance benefits from this optimization overlap with point no.2 above in that any docker-image-producing task whose inputs have not changed would not benefit from this optimization since it would not invoke Docker at all, but for tasks whose inputs have changed. The impact of this optimization is enhanced if we do not cache Gradle task output.

Each of these major optimizations currently faces challenges.

More Compute

Github Actions does not allow increasing resource allocation for workflow runners. Reference.

Custom runners are available, but have a security risk on public repositories like ours since forks could run arbitrary code. We could run each CI build in a Docker container on the self-hosted runner, but there are still security questions about the safety of those tasks, since CI runs would share caches. There's also the dev cost of self-hosting a runner. This is potentially doable, but is not low hanging fruit.

An alternative is to use a managed CI platform like Circle CI that provides customizable resource allocations, plus some of the other optimizations mentioned here (e.g: CircleCI's performance plan supports caching docker layers out of the box). This would probably require using some sort of paid offering like a non-free plan. At stock pricing for the performance plan, CircleCI cost would not be trivial, potentially ~3-5k per year. Back of the envelope math here.

Docker Layer caching

layer caching is not supported by github out of the box. There exist premade actions like this one, but the latest version fails on our repo with the issue specified here. We could write & maintain custom logic to achieve this. We can also use a CI provider that supports this out of the box. Or we could host our own Docker cache.

Gradle task output caching

Most of our build time is spent on building integrations, which consists of three parts: compiling the integration's source code & running unit tests (least time spent), building its docker image, and running integration tests against the image. Ideally we'd only build an integration if it or its dependencies have changed in the current commit. However, the gradle docker plugin we use cannot detect changes to a docker image's dependencies (e.g: a parent image specified with a FROM) unless its version has changed. Since all our integrations images depends on base images produced from our monorepo with the dev tag, the plugin cannot detect such changes, so we always rebuild the docker image as specified in this comment. This means Gradle cannot incrementally build any of our integrations and must build their images and run integration tests every time we run their build.

We could patch around these shortcomings by manually specifying in each integration's gradle.build file (or in a shared helper) which upstream docker-producing-modules it depends on, but ultimately this is error prone and could result in incorrect builds.

@cgardens
Copy link
Contributor Author

cgardens commented Oct 23, 2020

Splllllllitting up the Build

Looking at the big picture here, if our goal is to add N integrations and N is a non trivial number, having a monobuild is not going to work regardless of how much tuning we do and how much firepower we throw at it. Even when N gets to 15, that's just going to hurt, and our goal is to get to much more than 15 really quickly. Perhaps we should focus on figuring out how we are going to split up our build pipeline. My initial naive suggestion would be the following:

  • Primary Build (Runs on all PR commits):
    • Compile all platform source code. Run both integration and unit tests for platform source code.
    • Compile all integration source code. (For now this should be okay but when we hit enough integrations, even this might have to be moved out).
    • Run all tests (unit, integration, standard tests) for top 8 integrations
      • This number is chosen somewhat arbitrarily. This list should attempt to hold some of the most important integrations and / or integrations that rely on special cases in the platform.
    • Run acceptance tests.
  • Secondary Build (Runs only on merges to master):
    • Build and run tests on everything.
    • Alternatively we can have a build per integration that is triggered by pushes to master. Though for the first iteration, just keeping it as a single large build is probably fine and easier.

Tradeoffs

  • This should be simple to do quickly.
  • We started running the acceptance tests and integration tests in PR builds because we kept pushing code that failed them and not finding out until it hit master. I think we can get the best of both worlds here by including a handful of integrations in PR builds but not all of them.
  • Yes, we will still push changes that break integrations and will not realize it until we merge to master. But hopefully (if we choose the integrations we run as the primary build well) this should be rare and should be cases that only break one integration as opposed to breaking many.
    • Ideally one developing on a specific integration, should run the tests themselves before merging (if they are not part of the primary build). This won't always happen, but we'll eat this harm and fix quickly after merges to master.
  • We can still iterate quickly.
  • We still get a full correctness check in the build.
  • If we don't do something soon, we're going to start really hurting ourselves as we add new integrations.

@cgardens
Copy link
Contributor Author

cgardens commented Oct 26, 2020

@jrhizor and I discussed what we'd like to see out of out build system in the long term.

Ideal State (Long Term)

  • Goal: no-op build <30s
  • Incremental
    • All prior state/artifacts cached
      • Pre-requisites
        • Understand gradle caching and input. Right now we don’t understand this well enough.
      • Generators
        • Inputs for python protocol generation
      • Docker
        • Changes in the parent image’s hash
        • Input files (already specified)
      • Formatting
        • Does it work for spotless?
        • Definitely doesn’t work for black
      • webapp?
      • python Python module caching in builds #991
  • Reproducible Builds
    • we should try them out
  • Parallelizable
    • Migrate off of Github Actions or pack actions into a single gradlew task.
  • Beefier Compute ???
    • Migrate off of Github Actions
  • Daily build without caching
  • Trigger from Github comments
  • Split Integration Tests

I'm going to try to do the split integration tests piece now since I think it's super easy. We will look at fixing the incrementality stuff next week.

@cgardens
Copy link
Contributor Author

@jrhizor - delivered on this with incremental builds. (though still need to do python incremental builds).
@cgardens - implemented doing only some tests in PR builds to speed up iteration.

in the future we should split these out into their own tasks so we can close them as we go.

@cgardens
Copy link
Contributor Author

cgardens commented Nov 16, 2020

Github action usage / parallelization limits: https://docs.github.com/en/free-pro-team@latest/actions/reference/usage-limits-billing-and-administration

TestContainer's implementation of parallel jobs in github actions with gradle https://rnorth.org/faster-parallel-github-builds/

@davinchia
Copy link
Contributor

davinchia commented Jul 5, 2022

Our builds already run on custom runners today as of #3019.

Work to split out builds were done here:

Run time of a typical build is 20 mins.

@davinchia
Copy link
Contributor

@cgardens @sherifnada since most of the initial suggestions in the issue have been implemented, I'm thinking we can close this issue and open up more targeted follow up issues when needed. Thoughts?

@cgardens
Copy link
Contributor Author

agreed! feel free to close once you feel any remaining relevant information has been captured in the right spots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build team/prod-eng type/enhancement New feature or request type/refactoring Focused on refactoring.
Projects
None yet
Development

No branches or pull requests

5 participants