Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Images are now pre-build and stored in registry #10368

Merged

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Aug 17, 2020

With this change we utilise the latest pull_request_target
event type from Github Actions and we are building the
CI image only once (per version) for the entire run.

This safes from 2 to 10 minutes per job (!) depending on
how much of the Docker image needs to be rebuilt.

It works in the way that the image is built only in the
build-or-wait step. In case of direct push run or
scheduled runs, the build-or-wait step builds and pushes
to the GitHub registry the CI image. In case of the
pull_request runs, the build-and-wait step waits until
separate build-ci-image.yml workflow builds and pushes
the image and it will only move forward once the image
is ready.

This has numerous advantages:

  1. Each job that requires CI image is much faster because
    instead of pulling + rebuilding the image it only pulls
    the image that was build once. This saves around 2 minutes
    per job in regular builds but in case of python patch level
    updates, or adding new requirements it can save up to 10
    minutes per job (!)

  2. While the images are buing rebuilt we only block one job waiting
    for all the images. The tests will start running in parallell
    only when all images are ready, so we are not blocking
    other runs from running.

  3. Whole run uses THE SAME image. Previously we could have some
    variations because the images were built at different times
    and potentially releases of dependencies in-between several
    jobs could make different jobs in the same run use slightly
    different image. This is not happening any more.

  4. Also when we push image to github or dockerhub we push the
    very same image that was built and tested. Previously it could
    happen that the image pushed was slightly different than the
    one that was used for testing (for the same reason)

  5. Similar case is with the production images. We are now building
    and pushing consistently the same images accross the board.

  6. Documentation building is split into two parallel jobs docs
    building and spell checking - decreases elapsed time for
    the docs build.

  7. Last but not least - we keep the history of al the images

    • those images contain SHA of the commit. This means
      that we can simply download and run the image locally to reproduce
      any problem that anyone had in their PR (!). This is super useful
      to be able to help others to test their problems.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@potiuk
Copy link
Member Author

potiuk commented Aug 18, 2020

Hello everyone. This is a major overhaul of the way how we are utilizing GitHub Actions - something that was enabled by recent features released by GitHub Actions (namely "workflow_run" feature).

I've been working on it last week and heavily tested it on my fork https://github.com/potiuk/airflow/actions. I hope there will be rather little number of teething problems, but I am already very familiar with how GA work and I will be able to fix any problems quickly. I am also going to watch it once we merge it to make sure it works as expected.

See commit description for what is achieved by this change. I have just one thing to say - this is my "dream" architecture of the CI builds that I had in mind at the very beginning of my work on Airflow, one that could only be achieved by the most recent changes by GitHub. I really hope this is one of the last fundamental changes in the scripting for CI because I literally run out of ideas what can be improved (just kidding - there are always small things ;).

It has many nice properties but the most important ones:

  • 5-12 minutes saved for each Job (Builds of images are done only once not for each job). Not per whole run - but per Job (!). This will help both - increase number of parallell PRs that can be run and decrease the feedback time for each build. There were sometimes much slower builds when python base image was upgraded or Dockerfile changed - this problem will be gone.

  • the jobs/runs are fully consistent - all jobs in the same build use exactly the same image prepared only once.

  • full trackability and reproducibility of each run - we keep the images in GitHub registry and you can recreate the exact failed run by running ./breeze --github-image-id <RUN_ID> or ./breeze --github-image-id <COMMIT_ID> for merged runs.

  • I cleaned up outputs of the job so that they only show relevant information

  • I cleaned up initialization code for bash scripts - removed some duplicates and organized it better and I fully documented it - describing the purpose of all options (that was the lat script refactoring I planned)

It's quite a huge change, and I can try to split it into smaller ones (but conceptually it is one big overhaul of the way our CI works)

When you can start from the workflows at the end of the documentation: https://github.com/PolideaInternal/airflow/blob/prebuild-ci-images-in-github-actions/CI.rst - I prepared some sequence diagrams of the CI architecture (using mermaid - which is an absolutely cool tool for converting markdownish descriptions of diagrams into really nice diagrams). It explains all the "whys" and also "hows".

NOTE! For Review/Merge I needed to disable waiting for images, so the speedups are not visible yet - I have to merge it to master in order to enable the "Build Image" workflows. I also use "master" version of my own Github Cancel Action which I developed for that purpose - I will release it's v2 version and switch to it once we get a few days of the builds working in Airflow.

I developed https://github.com/potiuk/cancel-workflow-runs new Github Action for "Cancel Workflow Run" that is a swiss-army-knife of Run cancelling and I plan to share it with Apache Beam and other Apache projects that might need it as well.

I really look forward to review comments and merging it eventually. This will help all of contributors and committers to move faster. This is literally completion of 2 years of the "dream" architecture for our CI :).

@potiuk
Copy link
Member Author

potiuk commented Aug 18, 2020

CC: @ad-m

@potiuk potiuk force-pushed the prebuild-ci-images-in-github-actions branch from f2a9ce3 to 14460cd Compare August 18, 2020 00:48
@mik-laj
Copy link
Member

mik-laj commented Aug 18, 2020

This change is huge. What can we reduce its size? I can see a few changes that we can extract and do as a separate change:

  • merging to --github-organisation and --github-repo flags into --github-repository
  • building documentation and spell checking separately,.
  • renaming ci_*..sh scripts
  • pylint fixes in docs/build_docs.py
  • renaming variable in scripts/ci/in_container/run_prepare_backport_readme.sh, scripts/ci/in_container/run_test_package_import_all_classes.sh
  • less verbose output from helm,kubectl - scripts/ci/libraries/_verbosity.sh
  • pyllint fixes in scripts/tools/list-integrations.py.

This may seem like a lot of extra work, but I think it's the only way we can be sure about this change and not get a lot of teething problems. Now it is very painful for me to review this change because I have to remember every change and think about the consequences of many changes at once.

If you like, I will be happy to help you break down this change to make it easier to review.

BREEZE.rst Outdated Show resolved Hide resolved
CI.rst Outdated Show resolved Hide resolved
@potiuk
Copy link
Member Author

potiuk commented Aug 18, 2020

This change is huge. What can we reduce its size? I can see a few changes that we can extract and do as a separate cha

Yep. I am happy to do it. I will try to separate out some of the changes and keep on rebasing this one until the "gist" of the change remains as last commit

@potiuk potiuk force-pushed the prebuild-ci-images-in-github-actions branch from d21e67d to 7e8e0cc Compare August 18, 2020 10:49
potiuk added a commit to PolideaInternal/airflow that referenced this pull request Aug 18, 2020
potiuk added a commit that referenced this pull request Aug 18, 2020
potiuk added a commit to PolideaInternal/airflow that referenced this pull request Aug 18, 2020
This cleans up the document building process and replaces it
with breeze-only. The original instructions with
`pip install -e .[doc]` stopped working so there is no
point keeping them.

Extracted from apache#10368
@potiuk potiuk added this to the Airflow 1.10.13 milestone Nov 14, 2020
@potiuk potiuk added the type:misc/internal Changelog: Misc changes that should appear in change log label Nov 14, 2020
potiuk added a commit that referenced this pull request Nov 16, 2020
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:
 * test_local_run (often hangs the build)
 * test_retry_handling_job
 * test_clear_multiple_external_task_marker
 * test_should_force_kill_process
 * test_change_state_for_tis_without_dagrun
 * test_cli_webserver_background

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
 * TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.

(cherry picked from commit b746f33)
potiuk added a commit that referenced this pull request Nov 16, 2020
Part of #10368

(cherry picked from commit a32e90a)
potiuk added a commit that referenced this pull request Nov 16, 2020
potiuk added a commit that referenced this pull request Nov 16, 2020
The output of pre-commit builds on both CI and locally
is now limited to only show errors, unless verbose
variable is set.

We are utilising aliases if possible but in case of
pre-commits they are run in non-interactive shell which
means that aliases do not work as expected so we have
to run a few functions directly in other to
show spinner.

Extracted from #10368

(cherry picked from commit 77a635e)
potiuk added a commit that referenced this pull request Nov 16, 2020
This allows for all the kinds of verbosity we want, including
writing outputs to output files, and it also works out-of-the-box
in git-commit non-interactive shell scripts. Also as a side effect
we have mocked tools in bats tests, which will allow us to write
more comprehensive unit tests for the bash scripts of ours
(this is a long overdue task).

Part of #10368

(cherry picked from commit db446f2)
potiuk added a commit that referenced this pull request Nov 16, 2020
potiuk added a commit that referenced this pull request Nov 16, 2020
* CI Images are now pre-build and stored in registry

With this change we utilise the latest pull_request_target
event type from Github Actions and we are building the
CI image only once (per version) for the entire run.

This safes from 2 to 10 minutes per job (!) depending on
how much of the Docker image needs to be rebuilt.

It works in the way that the image is built only in the
build-or-wait step. In case of direct push run or
scheduled runs, the build-or-wait step builds and pushes
to the GitHub registry the CI image. In case of the
pull_request runs, the build-and-wait step waits until
separate build-ci-image.yml workflow builds and pushes
the image and it will only move forward once the image
is ready.

This has numerous advantages:

1) Each job that requires CI image is much faster because
   instead of pulling + rebuilding the image it only pulls
   the image that was build once. This saves around 2 minutes
   per job in regular builds but in case of python patch level
   updates, or adding new requirements it can save up to 10
   minutes per job (!)

2) While the images are buing rebuilt we only block one job waiting
   for all the images. The tests will start running in parallell
   only when all images are ready, so we are not blocking
   other runs from running.

3) Whole run uses THE SAME image. Previously we could have some
   variations because the images were built at different times
   and potentially releases of dependencies in-between several
   jobs could make different jobs in the same run use slightly
   different image. This is not happening any more.

4) Also when we push image to github or dockerhub we push the
   very same image that was built and tested. Previously it could
   happen that the image pushed was slightly different than the
   one that was used for testing (for the same reason)

5) Similar case is with the production images. We are now building
   and pushing consistently the same images accross the board.

6) Documentation building is split into two parallel jobs docs
   building and spell checking - decreases elapsed time for
   the docs build.

7) Last but not least - we keep the history of al the images
   - those images contain SHA of the commit. This means
   that we can simply download and run the image locally to reproduce
   any problem that anyone had in their PR (!). This is super useful
   to be able to help others to test their problems.

* fixup! CI Images are now pre-build and stored in registry

* fixup! fixup! CI Images are now pre-build and stored in registry

* fixup! fixup! fixup! CI Images are now pre-build and stored in registry

* fixup! fixup! fixup! CI Images are now pre-build and stored in registry

(cherry picked from commit de7500d)
potiuk added a commit that referenced this pull request Nov 16, 2020
This cleans up the document building process and replaces it
with breeze-only. The original instructions with
`pip install -e .[doc]` stopped working so there is no
point keeping them.

Extracted from #10368

(cherry picked from commit 9228bf2)
potiuk added a commit that referenced this pull request Nov 16, 2020
Follow up after #10368

(cherry picked from commit 2c3ce8e)
potiuk added a commit that referenced this pull request Nov 16, 2020
Breeze failed after #10368

(cherry picked from commit dc27a2a)
potiuk added a commit that referenced this pull request Nov 16, 2020
Follow up after #10368

(cherry picked from commit c35a010)
potiuk added a commit that referenced this pull request Nov 16, 2020
After #10368, we've changed the way we build the images
on CI. We are overriding the ci scripts that we use
to build the image with the scripts taken from master
to not give roque PR authors the possibiility to run
something with the write credentials.

We should not override the in_container scripts, however
because they become part of the image, so we should use
those that came with the PR. That's why we have to move
the "in_container" scripts out of the "ci" folder and
only override the "ci" folder with the one from
master. We've made sure that those scripts in ci
are self-contained and they do not need reach outside of
that folder.

Also the static checks are done with local files mounted
on CI because we want to check all the files - not only
those that are embedded in the container.

(cherry picked from commit 1cf1af6)
potiuk added a commit that referenced this pull request Nov 16, 2020
When we pull image from base registry we should also push
it to the GitHub Registry so that we can track the
origin of the run.

Note that it merely pushes the tag - the image will be
directly pulled from the registry so this will only update
the tag in registry.

Bug introduced in #10368

(cherry picked from commit c979de7)
potiuk added a commit that referenced this pull request Nov 16, 2020
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:
 * test_local_run (often hangs the build)
 * test_retry_handling_job
 * test_clear_multiple_external_task_marker
 * test_should_force_kill_process
 * test_change_state_for_tis_without_dagrun
 * test_cli_webserver_background

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
 * TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.

(cherry picked from commit b746f33)
kaxil pushed a commit that referenced this pull request Nov 18, 2020
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:
 * test_local_run (often hangs the build)
 * test_retry_handling_job
 * test_clear_multiple_external_task_marker
 * test_should_force_kill_process
 * test_change_state_for_tis_without_dagrun
 * test_cli_webserver_background

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
 * TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.

(cherry picked from commit b746f33)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
The output of pre-commit builds on both CI and locally
is now limited to only show errors, unless verbose
variable is set.

We are utilising aliases if possible but in case of
pre-commits they are run in non-interactive shell which
means that aliases do not work as expected so we have
to run a few functions directly in other to
show spinner.

Extracted from apache#10368

(cherry picked from commit 77a635e)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
This allows for all the kinds of verbosity we want, including
writing outputs to output files, and it also works out-of-the-box
in git-commit non-interactive shell scripts. Also as a side effect
we have mocked tools in bats tests, which will allow us to write
more comprehensive unit tests for the bash scripts of ours
(this is a long overdue task).

Part of apache#10368

(cherry picked from commit db446f2)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
* CI Images are now pre-build and stored in registry

With this change we utilise the latest pull_request_target
event type from Github Actions and we are building the
CI image only once (per version) for the entire run.

This safes from 2 to 10 minutes per job (!) depending on
how much of the Docker image needs to be rebuilt.

It works in the way that the image is built only in the
build-or-wait step. In case of direct push run or
scheduled runs, the build-or-wait step builds and pushes
to the GitHub registry the CI image. In case of the
pull_request runs, the build-and-wait step waits until
separate build-ci-image.yml workflow builds and pushes
the image and it will only move forward once the image
is ready.

This has numerous advantages:

1) Each job that requires CI image is much faster because
   instead of pulling + rebuilding the image it only pulls
   the image that was build once. This saves around 2 minutes
   per job in regular builds but in case of python patch level
   updates, or adding new requirements it can save up to 10
   minutes per job (!)

2) While the images are buing rebuilt we only block one job waiting
   for all the images. The tests will start running in parallell
   only when all images are ready, so we are not blocking
   other runs from running.

3) Whole run uses THE SAME image. Previously we could have some
   variations because the images were built at different times
   and potentially releases of dependencies in-between several
   jobs could make different jobs in the same run use slightly
   different image. This is not happening any more.

4) Also when we push image to github or dockerhub we push the
   very same image that was built and tested. Previously it could
   happen that the image pushed was slightly different than the
   one that was used for testing (for the same reason)

5) Similar case is with the production images. We are now building
   and pushing consistently the same images accross the board.

6) Documentation building is split into two parallel jobs docs
   building and spell checking - decreases elapsed time for
   the docs build.

7) Last but not least - we keep the history of al the images
   - those images contain SHA of the commit. This means
   that we can simply download and run the image locally to reproduce
   any problem that anyone had in their PR (!). This is super useful
   to be able to help others to test their problems.

* fixup! CI Images are now pre-build and stored in registry

* fixup! fixup! CI Images are now pre-build and stored in registry

* fixup! fixup! fixup! CI Images are now pre-build and stored in registry

* fixup! fixup! fixup! CI Images are now pre-build and stored in registry

(cherry picked from commit de7500d)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
…he#10377)

This cleans up the document building process and replaces it
with breeze-only. The original instructions with
`pip install -e .[doc]` stopped working so there is no
point keeping them.

Extracted from apache#10368

(cherry picked from commit 9228bf2)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
Breeze failed after apache#10368

(cherry picked from commit dc27a2a)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
…10442)

After apache#10368, we've changed the way we build the images
on CI. We are overriding the ci scripts that we use
to build the image with the scripts taken from master
to not give roque PR authors the possibiility to run
something with the write credentials.

We should not override the in_container scripts, however
because they become part of the image, so we should use
those that came with the PR. That's why we have to move
the "in_container" scripts out of the "ci" folder and
only override the "ci" folder with the one from
master. We've made sure that those scripts in ci
are self-contained and they do not need reach outside of
that folder.

Also the static checks are done with local files mounted
on CI because we want to check all the files - not only
those that are embedded in the container.

(cherry picked from commit 1cf1af6)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
When we pull image from base registry we should also push
it to the GitHub Registry so that we can track the
origin of the run.

Note that it merely pushes the tag - the image will be
directly pulled from the registry so this will only update
the tag in registry.

Bug introduced in apache#10368

(cherry picked from commit c979de7)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (apache#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the apache#10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:
 * test_local_run (often hangs the build)
 * test_retry_handling_job
 * test_clear_multiple_external_task_marker
 * test_should_force_kill_process
 * test_change_state_for_tis_without_dagrun
 * test_cli_webserver_background

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
 * TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.

(cherry picked from commit b746f33)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools type:misc/internal Changelog: Misc changes that should appear in change log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants