Skip to content

Comments

Support for replacing table in BigQueryCreateEmptyTableOperator#12051

Closed
shaneikennedy wants to merge 195 commits intoapache:v1-10-testfrom
shaneikennedy:v1-10-test
Closed

Support for replacing table in BigQueryCreateEmptyTableOperator#12051
shaneikennedy wants to merge 195 commits intoapache:v1-10-testfrom
shaneikennedy:v1-10-test

Conversation

@shaneikennedy
Copy link

This PR addresses #11911 and adds support for replacing an existing table when using the BigQueryCreateEmptyTableOperator.

It is a currently a WIP. Things that still need to be addressed:

  1. Test coverage, there's a preliminary test for what I think should be tested but it needs some style guidance
  2. Documentation updates

There are two commits for now but once the points above are resolved I will squash everything to a single commit 👍

@manesioz I know you wanted to work on this one too so please feel free to help out and we can co-author this one!

@turbaszek I tried to follow the suggestion you left on the related issue, let me know what you think!

varundhussa and others added 30 commits September 3, 2020 14:20
apache#10732)

* Create a script to migrate KubernetesExecutor airflow.cfg configs to pod_template_file

* fix help for command

* add test

* address comments

* pass for 2.7
Co-authored-by: Tomek Urbaszek <tomasz.urbaszek@polidea.com>
Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
* Fix breaking changes in Pod conversion for 1.10.13

* fix tests

* fix flake8

* fix test

* fix image secrets

* Update airflow/kubernetes/pod_launcher.py

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
* More fancy environment checking

* fixup! More fancy environment checking

(cherry picked from commit 88e5c35)
* Add redbubble link to Airflow merch

* Update README.md

Co-authored-by: Kamil Breguła <mik-laj@users.noreply.github.com>

Co-authored-by: Kamil Breguła <mik-laj@users.noreply.github.com>
(cherry picked from commit 558be73)
The output of pre-commit builds on both CI and locally
is now limited to only show errors, unless verbose
variable is set.

We are utilising aliases if possible but in case of
pre-commits they are run in non-interactive shell which
means that aliases do not work as expected so we have
to run a few functions directly in other to
show spinner.

Extracted from apache#10368

(cherry picked from commit 77a635e)
The EMBEDDED dags were only really useful for testing
but it required to customise built production image
(run with extra --build-arg flag). This is not needed
as it is better to extend the image instead with FROM
and add dags afterwards. This way you do not have
to rebuild the image while iterating on it.

(cherry picked from commit e179853)
This allows for all the kinds of verbosity we want, including
writing outputs to output files, and it also works out-of-the-box
in git-commit non-interactive shell scripts. Also as a side effect
we have mocked tools in bats tests, which will allow us to write
more comprehensive unit tests for the bash scripts of ours
(this is a long overdue task).

Part of apache#10368

(cherry picked from commit db446f2)
Previosuly it was failing with `unbound variable AIRFLOW_PROD_BASE_TAG` and failing because it could not find "kind" binary

(cherry picked from commit 5739ba2)
* CI Images are now pre-build and stored in registry

With this change we utilise the latest pull_request_target
event type from Github Actions and we are building the
CI image only once (per version) for the entire run.

This safes from 2 to 10 minutes per job (!) depending on
how much of the Docker image needs to be rebuilt.

It works in the way that the image is built only in the
build-or-wait step. In case of direct push run or
scheduled runs, the build-or-wait step builds and pushes
to the GitHub registry the CI image. In case of the
pull_request runs, the build-and-wait step waits until
separate build-ci-image.yml workflow builds and pushes
the image and it will only move forward once the image
is ready.

This has numerous advantages:

1) Each job that requires CI image is much faster because
   instead of pulling + rebuilding the image it only pulls
   the image that was build once. This saves around 2 minutes
   per job in regular builds but in case of python patch level
   updates, or adding new requirements it can save up to 10
   minutes per job (!)

2) While the images are buing rebuilt we only block one job waiting
   for all the images. The tests will start running in parallell
   only when all images are ready, so we are not blocking
   other runs from running.

3) Whole run uses THE SAME image. Previously we could have some
   variations because the images were built at different times
   and potentially releases of dependencies in-between several
   jobs could make different jobs in the same run use slightly
   different image. This is not happening any more.

4) Also when we push image to github or dockerhub we push the
   very same image that was built and tested. Previously it could
   happen that the image pushed was slightly different than the
   one that was used for testing (for the same reason)

5) Similar case is with the production images. We are now building
   and pushing consistently the same images accross the board.

6) Documentation building is split into two parallel jobs docs
   building and spell checking - decreases elapsed time for
   the docs build.

7) Last but not least - we keep the history of al the images
   - those images contain SHA of the commit. This means
   that we can simply download and run the image locally to reproduce
   any problem that anyone had in their PR (!). This is super useful
   to be able to help others to test their problems.

* fixup! CI Images are now pre-build and stored in registry

* fixup! fixup! CI Images are now pre-build and stored in registry

* fixup! fixup! fixup! CI Images are now pre-build and stored in registry

* fixup! fixup! fixup! CI Images are now pre-build and stored in registry

(cherry picked from commit de7500d)
Recent releases of FAB and Celery caused our installation to
fail. Luckily we have protection so that regular PRs are not
affected, however we need to update the setup.py to exclude
those dependencies that cause the problem.

Those are:

* vine - which is used by Celery Sensor (via kombu) - 5.0.0
  version breaks celery-vine feature

* Flask-OauthLib and flask-login - combination of the current
  requirements caused a conflict by forcing flask login to
  be 0.5.0 which is not compatible with Flask Application Builder

(cherry picked from commit f76ab1f)
Snakebite's kerberos support relied on a python-krbV
which has been removed from PyPI. It did not work
completely anyway due to snakebite not being officially
supported in python3 (snakebite-py3 did not work with
SSL which made Kerberos pretty much unusable.

This commit removes the snakebite's kerberos support
from setup.py so that you still can install kerberos
as extra for other uses.

(cherry picked from commit 35840ff)
…he#10377)

This cleans up the document building process and replaces it
with breeze-only. The original instructions with
`pip install -e .[doc]` stopped working so there is no
point keeping them.

Extracted from apache#10368

(cherry picked from commit 9228bf2)
Breeze failed after apache#10368

(cherry picked from commit dc27a2a)
potiuk and others added 13 commits October 12, 2020 01:30
Wrong if query in the GitHub action caused upgrade to latest
constraints did not work for a while.

(cherry picked from commit a34f5ee)
A problem was introduced in apache#11397 where a bit too many "Build Image"
jobs is being cancelled by subsequent Build Image run. For now it
cancels all the Build Image jobs that are running :(.

(cherry picked from commit 076fe88)
We have started to experience "unknown_blob" errors intermittently
recently with GitHub Docker registry. We might eventually need
to migrate to GCR (which eventually is going to replace the
Docker Registry for GitHub:

The ticket is opened to the Apache Infrastructure to enable
access to the GCR and to make some statements about Access
Rights management for GCR https://issues.apache.org/jira/projects/INFRA/issues/INFRA-20959
Also a ticket to GitHub Support has been raised about it
https://support.github.com/ticket/personal/0/861667 as we
cannot delete our public images in Docker registry.

But until this happens, the workaround might help us
to handle the situations where we got intermittent errors
while pushing to the registry. This seems to be a common
error, when NGINX proxy is used to proxy Github Registry so
it is likely that retrying will workaround the issue.

(cherry picked from commit f9dddd5)
* Add capability of customising PyPI sources

This change adds capability of customising installation of PyPI
modules via custom .pypirc file. This might allow to install
dependencies from in-house, vetted registry of PyPI

(cherry picked from commit 45d33db)
The SHA of cancel-workflow-action in apache#11397 was pointing to previous
(3.1) version of the action. This PR fixes it to point to the
right (3.2) version.

(cherry picked from commit 4de8f85)
* Modify helm chart to use pod_template_file

Since we are deprecating most k8sexecutor arguments
we should use the pod_template_file when launching airflow
using the KubernetesExecutor

* fix tests

* one more nit

* fix dag command

* fix pylint

(cherry picked from commit 56bd9b7)
…che#4751)

This decreases scheduler delay between tasks by about 20% for larger DAGs,
sometimes more for larger or more complex DAGs.

The delay between tasks can be a major issue, especially when we have dags with
many subdags, figures out that the scheduling process spends plenty of time in
dependency checking, we took the trigger rule dependency which calls the db for
each task instance, we made it call the db just once for each dag_run

(cherry picked from commit 50efda5)
If you used context from git repo, the .piprc file was missing and
COPY in Dockerfile is not conditional.

This change copies the .pypirc conditionally from the
docker-context-files folder instead.

Also it was needlessly copied in the main image where it is not
needed and it was even dangerous to do so.

(cherry picked from commit 53e5d8f)
apache#11911
This this makes for an easier idempotent create-empty-table workflow
@shaneikennedy shaneikennedy marked this pull request as draft November 2, 2020 19:53
@turbaszek
Copy link
Member

@shaneikennedy please work on master branch. We do not maintain operators in contrib module. See:
https://airflow.readthedocs.io/en/latest/backport-providers.html

@kaxil kaxil force-pushed the v1-10-test branch 2 times, most recently from c4c1cab to 91a1305 Compare November 12, 2020 21:07
@potiuk potiuk force-pushed the v1-10-test branch 9 times, most recently from 8bdd442 to 0122893 Compare November 16, 2020 15:18
@kaxil kaxil closed this Nov 17, 2020
@kaxil kaxil deleted the branch apache:v1-10-test November 17, 2020 12:51
@damjad
Copy link
Contributor

damjad commented Jul 11, 2023

Was this ever released in another in any other PR?

@potiuk
Copy link
Member

potiuk commented Jul 11, 2023

I think even eldest do not know after 3 years. But you have all the sources to look for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.