-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Images are now pre-build and stored in registry #10368
CI Images are now pre-build and stored in registry #10368
Conversation
Hello everyone. This is a major overhaul of the way how we are utilizing GitHub Actions - something that was enabled by recent features released by GitHub Actions (namely "workflow_run" feature). I've been working on it last week and heavily tested it on my fork https://github.com/potiuk/airflow/actions. I hope there will be rather little number of teething problems, but I am already very familiar with how GA work and I will be able to fix any problems quickly. I am also going to watch it once we merge it to make sure it works as expected. See commit description for what is achieved by this change. I have just one thing to say - this is my "dream" architecture of the CI builds that I had in mind at the very beginning of my work on Airflow, one that could only be achieved by the most recent changes by GitHub. I really hope this is one of the last fundamental changes in the scripting for CI because I literally run out of ideas what can be improved (just kidding - there are always small things ;). It has many nice properties but the most important ones:
It's quite a huge change, and I can try to split it into smaller ones (but conceptually it is one big overhaul of the way our CI works) When you can start from the workflows at the end of the documentation: https://github.com/PolideaInternal/airflow/blob/prebuild-ci-images-in-github-actions/CI.rst - I prepared some sequence diagrams of the CI architecture (using mermaid - which is an absolutely cool tool for converting markdownish descriptions of diagrams into really nice diagrams). It explains all the "whys" and also "hows". NOTE! For Review/Merge I needed to disable waiting for images, so the speedups are not visible yet - I have to merge it to master in order to enable the "Build Image" workflows. I also use "master" version of my own Github Cancel Action which I developed for that purpose - I will release it's v2 version and switch to it once we get a few days of the builds working in Airflow. I developed https://github.com/potiuk/cancel-workflow-runs new Github Action for "Cancel Workflow Run" that is a swiss-army-knife of Run cancelling and I plan to share it with Apache Beam and other Apache projects that might need it as well. I really look forward to review comments and merging it eventually. This will help all of contributors and committers to move faster. This is literally completion of 2 years of the "dream" architecture for our CI :). |
CC: @ad-m |
f2a9ce3
to
14460cd
Compare
This change is huge. What can we reduce its size? I can see a few changes that we can extract and do as a separate change:
This may seem like a lot of extra work, but I think it's the only way we can be sure about this change and not get a lot of teething problems. Now it is very painful for me to review this change because I have to remember every change and think about the consequences of many changes at once. If you like, I will be happy to help you break down this change to make it easier to review. |
Yep. I am happy to do it. I will try to separate out some of the changes and keep on rebasing this one until the "gist" of the change remains as last commit |
d21e67d
to
7e8e0cc
Compare
Extracted from apache#10368
This cleans up the document building process and replaces it with breeze-only. The original instructions with `pip install -e .[doc]` stopped working so there is no point keeping them. Extracted from apache#10368
We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (#10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the #10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only. (cherry picked from commit b746f33)
The output of pre-commit builds on both CI and locally is now limited to only show errors, unless verbose variable is set. We are utilising aliases if possible but in case of pre-commits they are run in non-interactive shell which means that aliases do not work as expected so we have to run a few functions directly in other to show spinner. Extracted from #10368 (cherry picked from commit 77a635e)
This allows for all the kinds of verbosity we want, including writing outputs to output files, and it also works out-of-the-box in git-commit non-interactive shell scripts. Also as a side effect we have mocked tools in bats tests, which will allow us to write more comprehensive unit tests for the bash scripts of ours (this is a long overdue task). Part of #10368 (cherry picked from commit db446f2)
* CI Images are now pre-build and stored in registry With this change we utilise the latest pull_request_target event type from Github Actions and we are building the CI image only once (per version) for the entire run. This safes from 2 to 10 minutes per job (!) depending on how much of the Docker image needs to be rebuilt. It works in the way that the image is built only in the build-or-wait step. In case of direct push run or scheduled runs, the build-or-wait step builds and pushes to the GitHub registry the CI image. In case of the pull_request runs, the build-and-wait step waits until separate build-ci-image.yml workflow builds and pushes the image and it will only move forward once the image is ready. This has numerous advantages: 1) Each job that requires CI image is much faster because instead of pulling + rebuilding the image it only pulls the image that was build once. This saves around 2 minutes per job in regular builds but in case of python patch level updates, or adding new requirements it can save up to 10 minutes per job (!) 2) While the images are buing rebuilt we only block one job waiting for all the images. The tests will start running in parallell only when all images are ready, so we are not blocking other runs from running. 3) Whole run uses THE SAME image. Previously we could have some variations because the images were built at different times and potentially releases of dependencies in-between several jobs could make different jobs in the same run use slightly different image. This is not happening any more. 4) Also when we push image to github or dockerhub we push the very same image that was built and tested. Previously it could happen that the image pushed was slightly different than the one that was used for testing (for the same reason) 5) Similar case is with the production images. We are now building and pushing consistently the same images accross the board. 6) Documentation building is split into two parallel jobs docs building and spell checking - decreases elapsed time for the docs build. 7) Last but not least - we keep the history of al the images - those images contain SHA of the commit. This means that we can simply download and run the image locally to reproduce any problem that anyone had in their PR (!). This is super useful to be able to help others to test their problems. * fixup! CI Images are now pre-build and stored in registry * fixup! fixup! CI Images are now pre-build and stored in registry * fixup! fixup! fixup! CI Images are now pre-build and stored in registry * fixup! fixup! fixup! CI Images are now pre-build and stored in registry (cherry picked from commit de7500d)
After #10368, we've changed the way we build the images on CI. We are overriding the ci scripts that we use to build the image with the scripts taken from master to not give roque PR authors the possibiility to run something with the write credentials. We should not override the in_container scripts, however because they become part of the image, so we should use those that came with the PR. That's why we have to move the "in_container" scripts out of the "ci" folder and only override the "ci" folder with the one from master. We've made sure that those scripts in ci are self-contained and they do not need reach outside of that folder. Also the static checks are done with local files mounted on CI because we want to check all the files - not only those that are embedded in the container. (cherry picked from commit 1cf1af6)
When we pull image from base registry we should also push it to the GitHub Registry so that we can track the origin of the run. Note that it merely pushes the tag - the image will be directly pulled from the registry so this will only update the tag in registry. Bug introduced in #10368 (cherry picked from commit c979de7)
We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (#10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the #10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only. (cherry picked from commit b746f33)
We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (#10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the #10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only. (cherry picked from commit b746f33)
Part of apache#10368 (cherry picked from commit a32e90a)
…#10387) Extracted from apache#10368 (cherry picked from commit 306a666)
The output of pre-commit builds on both CI and locally is now limited to only show errors, unless verbose variable is set. We are utilising aliases if possible but in case of pre-commits they are run in non-interactive shell which means that aliases do not work as expected so we have to run a few functions directly in other to show spinner. Extracted from apache#10368 (cherry picked from commit 77a635e)
This allows for all the kinds of verbosity we want, including writing outputs to output files, and it also works out-of-the-box in git-commit non-interactive shell scripts. Also as a side effect we have mocked tools in bats tests, which will allow us to write more comprehensive unit tests for the bash scripts of ours (this is a long overdue task). Part of apache#10368 (cherry picked from commit db446f2)
Part of apache#10368 (cherry picked from commit 08fe5c4)
* CI Images are now pre-build and stored in registry With this change we utilise the latest pull_request_target event type from Github Actions and we are building the CI image only once (per version) for the entire run. This safes from 2 to 10 minutes per job (!) depending on how much of the Docker image needs to be rebuilt. It works in the way that the image is built only in the build-or-wait step. In case of direct push run or scheduled runs, the build-or-wait step builds and pushes to the GitHub registry the CI image. In case of the pull_request runs, the build-and-wait step waits until separate build-ci-image.yml workflow builds and pushes the image and it will only move forward once the image is ready. This has numerous advantages: 1) Each job that requires CI image is much faster because instead of pulling + rebuilding the image it only pulls the image that was build once. This saves around 2 minutes per job in regular builds but in case of python patch level updates, or adding new requirements it can save up to 10 minutes per job (!) 2) While the images are buing rebuilt we only block one job waiting for all the images. The tests will start running in parallell only when all images are ready, so we are not blocking other runs from running. 3) Whole run uses THE SAME image. Previously we could have some variations because the images were built at different times and potentially releases of dependencies in-between several jobs could make different jobs in the same run use slightly different image. This is not happening any more. 4) Also when we push image to github or dockerhub we push the very same image that was built and tested. Previously it could happen that the image pushed was slightly different than the one that was used for testing (for the same reason) 5) Similar case is with the production images. We are now building and pushing consistently the same images accross the board. 6) Documentation building is split into two parallel jobs docs building and spell checking - decreases elapsed time for the docs build. 7) Last but not least - we keep the history of al the images - those images contain SHA of the commit. This means that we can simply download and run the image locally to reproduce any problem that anyone had in their PR (!). This is super useful to be able to help others to test their problems. * fixup! CI Images are now pre-build and stored in registry * fixup! fixup! CI Images are now pre-build and stored in registry * fixup! fixup! fixup! CI Images are now pre-build and stored in registry * fixup! fixup! fixup! CI Images are now pre-build and stored in registry (cherry picked from commit de7500d)
…he#10377) This cleans up the document building process and replaces it with breeze-only. The original instructions with `pip install -e .[doc]` stopped working so there is no point keeping them. Extracted from apache#10368 (cherry picked from commit 9228bf2)
Follow up after apache#10368 (cherry picked from commit 2c3ce8e)
Breeze failed after apache#10368 (cherry picked from commit dc27a2a)
Follow up after apache#10368 (cherry picked from commit c35a010)
…10442) After apache#10368, we've changed the way we build the images on CI. We are overriding the ci scripts that we use to build the image with the scripts taken from master to not give roque PR authors the possibiility to run something with the write credentials. We should not override the in_container scripts, however because they become part of the image, so we should use those that came with the PR. That's why we have to move the "in_container" scripts out of the "ci" folder and only override the "ci" folder with the one from master. We've made sure that those scripts in ci are self-contained and they do not need reach outside of that folder. Also the static checks are done with local files mounted on CI because we want to check all the files - not only those that are embedded in the container. (cherry picked from commit 1cf1af6)
When we pull image from base registry we should also push it to the GitHub Registry so that we can track the origin of the run. Note that it merely pushes the tag - the image will be directly pulled from the registry so this will only update the tag in registry. Bug introduced in apache#10368 (cherry picked from commit c979de7)
We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (apache#10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the apache#10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only. (cherry picked from commit b746f33)
With this change we utilise the latest pull_request_target
event type from Github Actions and we are building the
CI image only once (per version) for the entire run.
This safes from 2 to 10 minutes per job (!) depending on
how much of the Docker image needs to be rebuilt.
It works in the way that the image is built only in the
build-or-wait step. In case of direct push run or
scheduled runs, the build-or-wait step builds and pushes
to the GitHub registry the CI image. In case of the
pull_request runs, the build-and-wait step waits until
separate build-ci-image.yml workflow builds and pushes
the image and it will only move forward once the image
is ready.
This has numerous advantages:
Each job that requires CI image is much faster because
instead of pulling + rebuilding the image it only pulls
the image that was build once. This saves around 2 minutes
per job in regular builds but in case of python patch level
updates, or adding new requirements it can save up to 10
minutes per job (!)
While the images are buing rebuilt we only block one job waiting
for all the images. The tests will start running in parallell
only when all images are ready, so we are not blocking
other runs from running.
Whole run uses THE SAME image. Previously we could have some
variations because the images were built at different times
and potentially releases of dependencies in-between several
jobs could make different jobs in the same run use slightly
different image. This is not happening any more.
Also when we push image to github or dockerhub we push the
very same image that was built and tested. Previously it could
happen that the image pushed was slightly different than the
one that was used for testing (for the same reason)
Similar case is with the production images. We are now building
and pushing consistently the same images accross the board.
Documentation building is split into two parallel jobs docs
building and spell checking - decreases elapsed time for
the docs build.
Last but not least - we keep the history of al the images
that we can simply download and run the image locally to reproduce
any problem that anyone had in their PR (!). This is super useful
to be able to help others to test their problems.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.