Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BigQueryColumnCheckOperator runtime error #28796

Merged

Conversation

vchiapaikeo
Copy link
Contributor

A TypeError which causes tasks to fail at runtime exists in the BigQueryColumnCheckOperator. This was initially uncovered while looking at another issue for this operator here: #28343 (comment)

This fixes the operator by calling the list's extend() method instead of calling the list itself. Also added a few tests.

As an aside, I had to SKIP=run-mypy during my commit because I ran into this unusual pre-commit failure which doesn't seem relevant:

Run mypy for providers.................................................................Failed
- hook id: run-mypy
- exit code: 1

airflow/providers/google/cloud/operators/bigquery.py:250: error:
"BigQueryCheckOperator" has no attribute "_raise_exception"  [attr-defined]
                self._raise_exception(f"Test failed.\nQuery:\n{self.sql}\n...
                ^
Found 1 error in 1 file (checked 1 source file)
If you see strange stacktraces above, run `breeze ci-image build --python 3.7` and try again.

Test Dag

from airflow import DAG

from airflow.providers.google.cloud.operators.bigquery import BigQueryColumnCheckOperator

DEFAULT_TASK_ARGS = {
    "owner": "gcp-data-platform",
    "retries": 1,
    "retry_delay": 10,
    "start_date": "2022-08-01",
}

with DAG(
    max_active_runs=1,
    concurrency=2,
    catchup=False,
    schedule_interval="@daily",
    dag_id="test_bigquery_column_check",
    default_args=DEFAULT_TASK_ARGS,
) as dag:

    basic_column_quality_checks = BigQueryColumnCheckOperator(
            task_id="check_columns",
            table="my-project.vchiapaikeo.test1",
            use_legacy_sql=False,
            column_mapping={
                "col1": {"min": {"greater_than": 0}},
            },
        )

image

Task Logs:

686f5b14989d
*** Reading local file: /root/airflow/logs/dag_id=test_bigquery_column_check/run_id=scheduled__2023-01-08T00:00:00+00:00/task_id=check_columns/attempt=3.log
[2023-01-09, 01:40:19 UTC] {taskinstance.py:1093} INFO - Dependencies all met for <TaskInstance: test_bigquery_column_check.check_columns scheduled__2023-01-08T00:00:00+00:00 [queued]>
[2023-01-09, 01:40:19 UTC] {taskinstance.py:1093} INFO - Dependencies all met for <TaskInstance: test_bigquery_column_check.check_columns scheduled__2023-01-08T00:00:00+00:00 [queued]>
[2023-01-09, 01:40:19 UTC] {taskinstance.py:1295} INFO - 
--------------------------------------------------------------------------------
[2023-01-09, 01:40:19 UTC] {taskinstance.py:1296} INFO - Starting attempt 3 of 4
[2023-01-09, 01:40:19 UTC] {taskinstance.py:1297} INFO - 
--------------------------------------------------------------------------------
[2023-01-09, 01:40:19 UTC] {taskinstance.py:1316} INFO - Executing <Task(BigQueryColumnCheckOperator): check_columns> on 2023-01-08 00:00:00+00:00
[2023-01-09, 01:40:19 UTC] {standard_task_runner.py:55} INFO - Started process 481 to run task
[2023-01-09, 01:40:20 UTC] {standard_task_runner.py:82} INFO - Running: ['***', 'tasks', 'run', 'test_bigquery_column_check', 'check_columns', 'scheduled__2023-01-08T00:00:00+00:00', '--job-id', '5', '--raw', '--subdir', 'DAGS_FOLDER/test_bigquery_column_check.py', '--cfg-path', '/tmp/tmpgeqtp2hz']
[2023-01-09, 01:40:20 UTC] {standard_task_runner.py:83} INFO - Job 5: Subtask check_columns
[2023-01-09, 01:40:21 UTC] {task_command.py:391} INFO - Running <TaskInstance: test_bigquery_column_check.check_columns scheduled__2023-01-08T00:00:00+00:00 [running]> on host 686f5b14989d
[2023-01-09, 01:40:21 UTC] {taskinstance.py:1525} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=gcp-data-platform
AIRFLOW_CTX_DAG_ID=test_bigquery_column_check
AIRFLOW_CTX_TASK_ID=check_columns
AIRFLOW_CTX_EXECUTION_DATE=2023-01-08T00:00:00+00:00
AIRFLOW_CTX_TRY_NUMBER=3
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2023-01-08T00:00:00+00:00
[2023-01-09, 01:40:21 UTC] {base.py:73} INFO - Using connection ID 'google_cloud_default' for task execution.
[2023-01-09, 01:40:21 UTC] {credentials_provider.py:323} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2023-01-09, 01:40:21 UTC] {_default.py:649} WARNING - No project ID could be determined. Consider running `gcloud config set project` or setting the GOOGLE_CLOUD_PROJECT environment variable
[2023-01-09, 01:40:21 UTC] {bigquery.py:1539} INFO - Inserting job ***_1673228421636668_2d2a9b688dcd63bef1c449cd8b764f86
[2023-01-09, 01:40:23 UTC] {bigquery.py:601} INFO - Record:   col_name check_type  check_result
0     col1        min             2
[2023-01-09, 01:40:23 UTC] {bigquery.py:628} INFO - All tests have passed
[2023-01-09, 01:40:23 UTC] {taskinstance.py:1339} INFO - Marking task as SUCCESS. dag_id=test_bigquery_column_check, task_id=check_columns, execution_date=20230108T000000, start_date=20230109T014019, end_date=20230109T014023
[2023-01-09, 01:40:23 UTC] {local_task_job.py:211} INFO - Task exited with return code 0
[2023-01-09, 01:40:23 UTC] {taskinstance.py:2613} INFO - 0 downstream tasks scheduled from follow-on schedule check

cc: @eladkal , @VladaZakharova , @denimalpaca


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants