CI Images are now pre-build and stored in registry #10368

potiuk · 2020-08-17T23:47:51Z

With this change we utilise the latest pull_request_target
event type from Github Actions and we are building the
CI image only once (per version) for the entire run.

This safes from 2 to 10 minutes per job (!) depending on
how much of the Docker image needs to be rebuilt.

It works in the way that the image is built only in the
build-or-wait step. In case of direct push run or
scheduled runs, the build-or-wait step builds and pushes
to the GitHub registry the CI image. In case of the
pull_request runs, the build-and-wait step waits until
separate build-ci-image.yml workflow builds and pushes
the image and it will only move forward once the image
is ready.

This has numerous advantages:

Each job that requires CI image is much faster because
instead of pulling + rebuilding the image it only pulls
the image that was build once. This saves around 2 minutes
per job in regular builds but in case of python patch level
updates, or adding new requirements it can save up to 10
minutes per job (!)
While the images are buing rebuilt we only block one job waiting
for all the images. The tests will start running in parallell
only when all images are ready, so we are not blocking
other runs from running.
Whole run uses THE SAME image. Previously we could have some
variations because the images were built at different times
and potentially releases of dependencies in-between several
jobs could make different jobs in the same run use slightly
different image. This is not happening any more.
Also when we push image to github or dockerhub we push the
very same image that was built and tested. Previously it could
happen that the image pushed was slightly different than the
one that was used for testing (for the same reason)
Similar case is with the production images. We are now building
and pushing consistently the same images accross the board.
Documentation building is split into two parallel jobs docs
building and spell checking - decreases elapsed time for
the docs build.
Last but not least - we keep the history of al the images
- those images contain SHA of the commit. This means
  that we can simply download and run the image locally to reproduce
  any problem that anyone had in their PR (!). This is super useful
  to be able to help others to test their problems.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

potiuk · 2020-08-18T00:13:52Z

Hello everyone. This is a major overhaul of the way how we are utilizing GitHub Actions - something that was enabled by recent features released by GitHub Actions (namely "workflow_run" feature).

I've been working on it last week and heavily tested it on my fork https://github.com/potiuk/airflow/actions. I hope there will be rather little number of teething problems, but I am already very familiar with how GA work and I will be able to fix any problems quickly. I am also going to watch it once we merge it to make sure it works as expected.

See commit description for what is achieved by this change. I have just one thing to say - this is my "dream" architecture of the CI builds that I had in mind at the very beginning of my work on Airflow, one that could only be achieved by the most recent changes by GitHub. I really hope this is one of the last fundamental changes in the scripting for CI because I literally run out of ideas what can be improved (just kidding - there are always small things ;).

It has many nice properties but the most important ones:

5-12 minutes saved for each Job (Builds of images are done only once not for each job). Not per whole run - but per Job (!). This will help both - increase number of parallell PRs that can be run and decrease the feedback time for each build. There were sometimes much slower builds when python base image was upgraded or Dockerfile changed - this problem will be gone.
the jobs/runs are fully consistent - all jobs in the same build use exactly the same image prepared only once.
full trackability and reproducibility of each run - we keep the images in GitHub registry and you can recreate the exact failed run by running ./breeze --github-image-id <RUN_ID> or ./breeze --github-image-id <COMMIT_ID> for merged runs.
I cleaned up outputs of the job so that they only show relevant information
I cleaned up initialization code for bash scripts - removed some duplicates and organized it better and I fully documented it - describing the purpose of all options (that was the lat script refactoring I planned)

It's quite a huge change, and I can try to split it into smaller ones (but conceptually it is one big overhaul of the way our CI works)

When you can start from the workflows at the end of the documentation: https://github.com/PolideaInternal/airflow/blob/prebuild-ci-images-in-github-actions/CI.rst - I prepared some sequence diagrams of the CI architecture (using mermaid - which is an absolutely cool tool for converting markdownish descriptions of diagrams into really nice diagrams). It explains all the "whys" and also "hows".

NOTE! For Review/Merge I needed to disable waiting for images, so the speedups are not visible yet - I have to merge it to master in order to enable the "Build Image" workflows. I also use "master" version of my own Github Cancel Action which I developed for that purpose - I will release it's v2 version and switch to it once we get a few days of the builds working in Airflow.

I developed https://github.com/potiuk/cancel-workflow-runs new Github Action for "Cancel Workflow Run" that is a swiss-army-knife of Run cancelling and I plan to share it with Apache Beam and other Apache projects that might need it as well.

I really look forward to review comments and merging it eventually. This will help all of contributors and committers to move faster. This is literally completion of 2 years of the "dream" architecture for our CI :).

potiuk · 2020-08-18T00:38:49Z

CC: @ad-m

mik-laj · 2020-08-18T01:25:53Z

This change is huge. What can we reduce its size? I can see a few changes that we can extract and do as a separate change:

merging to --github-organisation and --github-repo flags into --github-repository
building documentation and spell checking separately,.
renaming ci_*..sh scripts
pylint fixes in docs/build_docs.py
renaming variable in scripts/ci/in_container/run_prepare_backport_readme.sh, scripts/ci/in_container/run_test_package_import_all_classes.sh
less verbose output from helm,kubectl - scripts/ci/libraries/_verbosity.sh
pyllint fixes in scripts/tools/list-integrations.py.

This may seem like a lot of extra work, but I think it's the only way we can be sure about this change and not get a lot of teething problems. Now it is very painful for me to review this change because I have to remember every change and think about the consequences of many changes at once.

If you like, I will be happy to help you break down this change to make it easier to review.

BREEZE.rst

CI.rst

potiuk · 2020-08-18T10:06:22Z

This change is huge. What can we reduce its size? I can see a few changes that we can extract and do as a separate cha

Yep. I am happy to do it. I will try to separate out some of the changes and keep on rebasing this one until the "gist" of the change remains as last commit

Extracted from apache#10368

Extracted from #10368

This cleans up the document building process and replaces it with breeze-only. The original instructions with `pip install -e .[doc]` stopped working so there is no point keeping them. Extracted from apache#10368

We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (#10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the #10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only. (cherry picked from commit b746f33)

Part of #10368 (cherry picked from commit a32e90a)

Extracted from #10368 (cherry picked from commit 306a666)

The output of pre-commit builds on both CI and locally is now limited to only show errors, unless verbose variable is set. We are utilising aliases if possible but in case of pre-commits they are run in non-interactive shell which means that aliases do not work as expected so we have to run a few functions directly in other to show spinner. Extracted from #10368 (cherry picked from commit 77a635e)

This allows for all the kinds of verbosity we want, including writing outputs to output files, and it also works out-of-the-box in git-commit non-interactive shell scripts. Also as a side effect we have mocked tools in bats tests, which will allow us to write more comprehensive unit tests for the bash scripts of ours (this is a long overdue task). Part of #10368 (cherry picked from commit db446f2)

Part of #10368 (cherry picked from commit 08fe5c4)

* CI Images are now pre-build and stored in registry With this change we utilise the latest pull_request_target event type from Github Actions and we are building the CI image only once (per version) for the entire run. This safes from 2 to 10 minutes per job (!) depending on how much of the Docker image needs to be rebuilt. It works in the way that the image is built only in the build-or-wait step. In case of direct push run or scheduled runs, the build-or-wait step builds and pushes to the GitHub registry the CI image. In case of the pull_request runs, the build-and-wait step waits until separate build-ci-image.yml workflow builds and pushes the image and it will only move forward once the image is ready. This has numerous advantages: 1) Each job that requires CI image is much faster because instead of pulling + rebuilding the image it only pulls the image that was build once. This saves around 2 minutes per job in regular builds but in case of python patch level updates, or adding new requirements it can save up to 10 minutes per job (!) 2) While the images are buing rebuilt we only block one job waiting for all the images. The tests will start running in parallell only when all images are ready, so we are not blocking other runs from running. 3) Whole run uses THE SAME image. Previously we could have some variations because the images were built at different times and potentially releases of dependencies in-between several jobs could make different jobs in the same run use slightly different image. This is not happening any more. 4) Also when we push image to github or dockerhub we push the very same image that was built and tested. Previously it could happen that the image pushed was slightly different than the one that was used for testing (for the same reason) 5) Similar case is with the production images. We are now building and pushing consistently the same images accross the board. 6) Documentation building is split into two parallel jobs docs building and spell checking - decreases elapsed time for the docs build. 7) Last but not least - we keep the history of al the images - those images contain SHA of the commit. This means that we can simply download and run the image locally to reproduce any problem that anyone had in their PR (!). This is super useful to be able to help others to test their problems. * fixup! CI Images are now pre-build and stored in registry * fixup! fixup! CI Images are now pre-build and stored in registry * fixup! fixup! fixup! CI Images are now pre-build and stored in registry * fixup! fixup! fixup! CI Images are now pre-build and stored in registry (cherry picked from commit de7500d)

This cleans up the document building process and replaces it with breeze-only. The original instructions with `pip install -e .[doc]` stopped working so there is no point keeping them. Extracted from #10368 (cherry picked from commit 9228bf2)

Follow up after #10368 (cherry picked from commit 2c3ce8e)

Breeze failed after #10368 (cherry picked from commit dc27a2a)

Follow up after #10368 (cherry picked from commit c35a010)

After #10368, we've changed the way we build the images on CI. We are overriding the ci scripts that we use to build the image with the scripts taken from master to not give roque PR authors the possibiility to run something with the write credentials. We should not override the in_container scripts, however because they become part of the image, so we should use those that came with the PR. That's why we have to move the "in_container" scripts out of the "ci" folder and only override the "ci" folder with the one from master. We've made sure that those scripts in ci are self-contained and they do not need reach outside of that folder. Also the static checks are done with local files mounted on CI because we want to check all the files - not only those that are embedded in the container. (cherry picked from commit 1cf1af6)

When we pull image from base registry we should also push it to the GitHub Registry so that we can track the origin of the run. Note that it merely pushes the tag - the image will be directly pulled from the registry so this will only update the tag in registry. Bug introduced in #10368 (cherry picked from commit c979de7)

We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (#10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the #10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only. (cherry picked from commit b746f33)

Part of apache#10368 (cherry picked from commit a32e90a)

…#10387) Extracted from apache#10368 (cherry picked from commit 306a666)

The output of pre-commit builds on both CI and locally is now limited to only show errors, unless verbose variable is set. We are utilising aliases if possible but in case of pre-commits they are run in non-interactive shell which means that aliases do not work as expected so we have to run a few functions directly in other to show spinner. Extracted from apache#10368 (cherry picked from commit 77a635e)

This allows for all the kinds of verbosity we want, including writing outputs to output files, and it also works out-of-the-box in git-commit non-interactive shell scripts. Also as a side effect we have mocked tools in bats tests, which will allow us to write more comprehensive unit tests for the bash scripts of ours (this is a long overdue task). Part of apache#10368 (cherry picked from commit db446f2)

Part of apache#10368 (cherry picked from commit 08fe5c4)

* CI Images are now pre-build and stored in registry With this change we utilise the latest pull_request_target event type from Github Actions and we are building the CI image only once (per version) for the entire run. This safes from 2 to 10 minutes per job (!) depending on how much of the Docker image needs to be rebuilt. It works in the way that the image is built only in the build-or-wait step. In case of direct push run or scheduled runs, the build-or-wait step builds and pushes to the GitHub registry the CI image. In case of the pull_request runs, the build-and-wait step waits until separate build-ci-image.yml workflow builds and pushes the image and it will only move forward once the image is ready. This has numerous advantages: 1) Each job that requires CI image is much faster because instead of pulling + rebuilding the image it only pulls the image that was build once. This saves around 2 minutes per job in regular builds but in case of python patch level updates, or adding new requirements it can save up to 10 minutes per job (!) 2) While the images are buing rebuilt we only block one job waiting for all the images. The tests will start running in parallell only when all images are ready, so we are not blocking other runs from running. 3) Whole run uses THE SAME image. Previously we could have some variations because the images were built at different times and potentially releases of dependencies in-between several jobs could make different jobs in the same run use slightly different image. This is not happening any more. 4) Also when we push image to github or dockerhub we push the very same image that was built and tested. Previously it could happen that the image pushed was slightly different than the one that was used for testing (for the same reason) 5) Similar case is with the production images. We are now building and pushing consistently the same images accross the board. 6) Documentation building is split into two parallel jobs docs building and spell checking - decreases elapsed time for the docs build. 7) Last but not least - we keep the history of al the images - those images contain SHA of the commit. This means that we can simply download and run the image locally to reproduce any problem that anyone had in their PR (!). This is super useful to be able to help others to test their problems. * fixup! CI Images are now pre-build and stored in registry * fixup! fixup! CI Images are now pre-build and stored in registry * fixup! fixup! fixup! CI Images are now pre-build and stored in registry * fixup! fixup! fixup! CI Images are now pre-build and stored in registry (cherry picked from commit de7500d)

…he#10377) This cleans up the document building process and replaces it with breeze-only. The original instructions with `pip install -e .[doc]` stopped working so there is no point keeping them. Extracted from apache#10368 (cherry picked from commit 9228bf2)

Follow up after apache#10368 (cherry picked from commit 2c3ce8e)

Breeze failed after apache#10368 (cherry picked from commit dc27a2a)

Follow up after apache#10368 (cherry picked from commit c35a010)

…10442) After apache#10368, we've changed the way we build the images on CI. We are overriding the ci scripts that we use to build the image with the scripts taken from master to not give roque PR authors the possibiility to run something with the write credentials. We should not override the in_container scripts, however because they become part of the image, so we should use those that came with the PR. That's why we have to move the "in_container" scripts out of the "ci" folder and only override the "ci" folder with the one from master. We've made sure that those scripts in ci are self-contained and they do not need reach outside of that folder. Also the static checks are done with local files mounted on CI because we want to check all the files - not only those that are embedded in the container. (cherry picked from commit 1cf1af6)

When we pull image from base registry we should also push it to the GitHub Registry so that we can track the origin of the run. Note that it merely pushes the tag - the image will be directly pulled from the registry so this will only update the tag in registry. Bug introduced in apache#10368 (cherry picked from commit c979de7)

We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (apache#10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the apache#10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only. (cherry picked from commit b746f33)

boring-cyborg bot added area:dev-tools area:docs labels Aug 17, 2020

potiuk requested review from mik-laj, kaxil, BasPH, dimberman, feluelle, houqp, ryw, turbaszek, zhongjiajie and ashb August 17, 2020 23:47

potiuk force-pushed the prebuild-ci-images-in-github-actions branch from f2a9ce3 to 14460cd Compare August 18, 2020 00:48

ad-m reviewed Aug 18, 2020

View reviewed changes

BREEZE.rst Outdated Show resolved Hide resolved

CI.rst Outdated Show resolved Hide resolved

potiuk force-pushed the prebuild-ci-images-in-github-actions branch from d21e67d to 7e8e0cc Compare August 18, 2020 10:49

potiuk mentioned this pull request Aug 18, 2020

Allow running spellcheck or building docs separately #10377

Merged

potiuk added a commit to PolideaInternal/airflow that referenced this pull request Aug 18, 2020

Pylintable list-integrations.py

1c54b60

Extracted from apache#10368

potiuk mentioned this pull request Aug 18, 2020

Pylintable list-integrations.py #10378

Merged

potiuk mentioned this pull request Aug 18, 2020

Updated Documentation for the CI with mermaid sequence diagrams #10380

Merged

potiuk added a commit that referenced this pull request Aug 18, 2020

Pylintable list-integrations.py (#10378)

51a1b38

Extracted from #10368

potiuk added this to the Airflow 1.10.13 milestone Nov 14, 2020

potiuk added the type:misc/internal Changelog: Misc changes that should appear in change log label Nov 14, 2020

potiuk added a commit that referenced this pull request Nov 16, 2020

Update github flags for Breeze (#10384)

e13ee55

Part of #10368 (cherry picked from commit a32e90a)

potiuk added a commit that referenced this pull request Nov 16, 2020

Docker images are now consistently labelled and a bit smaller (#10387)

46809a7

Extracted from #10368 (cherry picked from commit 306a666)

potiuk added a commit that referenced this pull request Nov 16, 2020

Constraint CI scripts are now separated out (#10404)

db7446e

Part of #10368 (cherry picked from commit 08fe5c4)

potiuk added a commit that referenced this pull request Nov 16, 2020

Enable optimisation of image building. (#10422)

2ce09fc

Follow up after #10368 (cherry picked from commit 2c3ce8e)

potiuk added a commit that referenced this pull request Nov 16, 2020

Fix failing breeze (#10424)

3fc98fb

Breeze failed after #10368 (cherry picked from commit dc27a2a)

potiuk added a commit that referenced this pull request Nov 16, 2020

Switch to released cancel-workflow-runs action (#10423)

562e019

Follow up after #10368 (cherry picked from commit c35a010)

cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021

Update github flags for Breeze (apache#10384)

167f93a

Part of apache#10368 (cherry picked from commit a32e90a)

cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021

Docker images are now consistently labelled and a bit smaller (apache…

f4ffeac

…#10387) Extracted from apache#10368 (cherry picked from commit 306a666)

cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021

Constraint CI scripts are now separated out (apache#10404)

1aaa124

Part of apache#10368 (cherry picked from commit 08fe5c4)

cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021

Enable optimisation of image building. (apache#10422)

6bdc3a2

Follow up after apache#10368 (cherry picked from commit 2c3ce8e)

cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021

Fix failing breeze (apache#10424)

2c46fdc

Breeze failed after apache#10368 (cherry picked from commit dc27a2a)

cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021

Switch to released cancel-workflow-runs action (apache#10423)

511a92a

Follow up after apache#10368 (cherry picked from commit c35a010)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Images are now pre-build and stored in registry #10368

CI Images are now pre-build and stored in registry #10368

potiuk commented Aug 17, 2020

potiuk commented Aug 18, 2020 •

edited

Loading

potiuk commented Aug 18, 2020

mik-laj commented Aug 18, 2020 •

edited

Loading

potiuk commented Aug 18, 2020 •

edited

Loading

CI Images are now pre-build and stored in registry #10368

CI Images are now pre-build and stored in registry #10368

Conversation

potiuk commented Aug 17, 2020

potiuk commented Aug 18, 2020 • edited Loading

potiuk commented Aug 18, 2020

mik-laj commented Aug 18, 2020 • edited Loading

potiuk commented Aug 18, 2020 • edited Loading

potiuk commented Aug 18, 2020 •

edited

Loading

mik-laj commented Aug 18, 2020 •

edited

Loading

potiuk commented Aug 18, 2020 •

edited

Loading