Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigquery Operators in defferable mode fail if the location is not US. #29307

Closed
1 of 2 tasks
masterlittle opened this issue Feb 2, 2023 · 8 comments
Closed
1 of 2 tasks
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@masterlittle
Copy link
Contributor

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

apache-airflow-providers-google==8.8.0

Apache Airflow version

2.5.1

Operating System

Mac

Deployment

Astronomer

Deployment details

No response

What happened

While using BigqueryInsertJobOperator in defferable mode, it fails.
Logs -

airflow.exceptions.AirflowException: 404, message='Not Found: {\n "error": {\n "code": 404,\n "message": "Not found: Job dev-data-platform-294611:airflow_derived_tables_all_tasks_group_temp_user_seg_all_create_stage_table_2023_02_01T04_30_00_00_00_0f2853ad8762909d41067023ddb3c6d8",\n "errors": [\n {\n "message": "Not found: Job dev-data-platform-294611:airflow_derived_tables_all_tasks_group_temp_user_seg_all_create_stage_table_2023_02_01T04_30_00_00_00_0f2853ad8762909d41067023ddb3c6d8",\n "domain": "global",\n "reason": "notFound"\n }\n ],\n "status": "NOT_FOUND"\n }\n}\n', url=URL('https://www.googleapis.com/bigquery/v2/projects/dev-data-platform-294611/jobs/airflow_derived_tables_all_tasks_group_temp_user_seg_all_create_stage_table_2023_02_01T04_30_00_00_00_0f2853ad8762909d41067023ddb3c6d8')

What you think should happen instead

The bigquery insert job should succeed.

Debugged and the error happens because the google-aio lib does not pass in the location param when making the GET JOB api call to bigquery.

Acc. to docs if using any region besides us and europe, this location should be passed.

As seen in the logs, it uses the domain global instead of the location that is passed to it.

How to reproduce

In the BigqueryInsertJobOperator, give a location which is not in US (eg - asia-south1) and make deferrable=true

BigQueryInsertJobOperator(
                            task_id="create_stage_table",
                            project_id=configuration.task_info.bigquery_info.billing_project_id,
                            location='asia-south1',
                            deferrable=True,
                            configuration={
                                "query": {
                                    "query": f"{{{{ ti.xcom_pull(task_ids='all_tasks.group_"
                                    f"{configuration.task_info.table_name}.prepare_query') }}}}",
                                    "useLegacySql": False,
                                    "priority": "BATCH",
                                    "writeDisposition": "WRITE_TRUNCATE",
                                    "destinationTable": {
                                        "projectId": configuration.task_info.project_name,
                                        "datasetId": configuration.task_info.stage_dataset,
                                        "tableId": configuration.task_info.stage_table,
                                    },
                                    "clustering": {"fields": get_clustering_key(configuration)},
                                    "time_partitioning": {
                                        'field': configuration.task_info.table_info.partition_key,
                                        'type': 'DAY',
                                    },
                                },
                                "dryRun": "{{ var.value.get('derived_tables_dry_run', 0) }}",
                                "labels": get_query_labels(configuration),
                            },
                        )

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@masterlittle masterlittle added area:providers kind:bug This is a clearly a bug labels Feb 2, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 2, 2023

Thanks for opening your first issue here! Be sure to follow the issue template!

@eladkal
Copy link
Contributor

eladkal commented Feb 2, 2023

@lwyszomi can you take a look?

@eladkal eladkal added provider:google Google (including GCP) related issues good first issue labels Feb 2, 2023
@VladaZakharova
Copy link
Contributor

Hi Team!
Thank you for detailed description, i will take a look on it and let you know about any updates

@VladaZakharova
Copy link
Contributor

VladaZakharova commented Feb 7, 2023

Hi Team!
After some investigation, it looks like that it is not a problem of hook implementation, but the way it uses Job object from gcloud.aio.bigquery package.
For correct retrieving of the Job with specific job_id it is required to pass location, if you are using location other then US and you will get 404 error:

curl -H "Authorization: Bearer $ACCESS_TOKEN"   "https://www.googleapis.com/bigquery/v2/projects/airflow-system-tests-303516/jobs/AAA_f9a0fb8126085cf06d9155cc320cbbfa"
{
  "error": {
    "code": 404,
    "message": "Not found: Job airflow-system-tests-303516.AAA_f9a0fb8126085cf06d9155cc320cbbfa",
    "errors": [
      {
        "message": "Not found: Job airflow-system-tests-303516.AAA_f9a0fb8126085cf06d9155cc320cbbfa",
        "domain": "global",
        "reason": "notFound"
      }
    ],
    "status": "NOT_FOUND"
  }
}

But specifying the correct location of the job in the url solves this problem:

curl -H "Authorization: Bearer $ACCESS_TOKEN"   "https://www.googleapis.com/bigquery/v2/projects/airflow-system-tests-303516/jobs/AAA_f9a0fb8126085cf06d9155cc320cbbfa?location=asia-south1"
{
  "kind": "bigquery#job",
  "etag": "7Ut/s/Jhqvgpbb8miRNbYg==",
  "id": "airflow-system-tests-303516:asia-south1.AAA_f9a0fb8126085cf06d9155cc320cbbfa",
  "selfLink": "https://www.googleapis.com/bigquery/v2/projects/airflow-system-tests-303516/jobs/AAA_f9a0fb8126085cf06d9155cc320cbbfa?location=asia-south1",
  "user_email": "vlada-system-test@airflow-system-tests-303516.iam.gserviceaccount.com",
  "configuration": {
    "query": {...

Current implementation of the methods for Job object from gcloud.aio.bigquery package doesn't support adding location as a parameter for the methods:
https://github.com/talkiq/gcloud-aio/blob/c64b370bdae5aa0bd72373d1e4ef1c4a4b55a3c7/bigquery/gcloud/aio/bigquery/job.py#L60
But on the other hand, sync implementation of this method for QueryJob object takes location as a parameter when constructing url for the call:
https://github.com/googleapis/python-bigquery/blob/beab7c2b27c27d8e824cbc66b290be8158da7abf/google/cloud/bigquery/client.py#L188

I have created an issue in GitHub repo for the gcloud.aio.bigquery package with the example and reproduction steps of the problem, however it may take some time to covert this issue from their side:
talkiq/gcloud-aio#584

As a workaround, I can implement async methods for BigQuery calls and don't use the Job object from gcloud.aio.bigquery package, but it will require to change all other classes that use current implementation of BigQueryAsyncHook with calling Job object methods: BigQueryIntervalCheckTrigger, BigQueryValueCheckTrigger, BigQueryCheckTrigger, BigQueryGetDataTrigger and all operators that use those triggers.

Please, let me know what option you prefer more, thanks :)

@eladkal
Copy link
Contributor

eladkal commented Feb 7, 2023

I don't think Airflow should workaround bugs/issues of upstream packages (at least not these kind of bugs)

I would suggest to link here the bug report of upstream repo, users who will hit this issue can find the workaround as you suggested and comment in the upstream package repo to increase awareness for the bug.

@VladaZakharova
Copy link
Contributor

Okay, sure. I have updated previous comment with the link to created issue.

@eladkal
Copy link
Contributor

eladkal commented Feb 24, 2023

@VladaZakharova so if i'm reading this right we don't have a task on this issue and this one should be fixed upstream? if I think we can close this one?

@VladaZakharova
Copy link
Contributor

@eladkal Yes, you are right, you can close it. I have placed the solution o the comment and the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

No branches or pull requests

3 participants