Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCSToBigQueryOperator and BigQueryToGCSOperator do not respect their project_id arguments #32106

Closed
2 tasks done
bhagany opened this issue Jun 23, 2023 · 9 comments · Fixed by #32232
Closed
2 tasks done
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@bhagany
Copy link
Contributor

bhagany commented Jun 23, 2023

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

We experienced this issue Airflow 2.6.1, but the problem exists in the Google provider rather than core Airflow, and were introduced with these changes. We are using version 10.0.0 of the provider.

The issue that resulted in these changes seems to be based on an incorrect understanding of how projects interact in BigQuery -- namely that the project used for storage and the project used for compute can be separate. The user reporting the issue appears to mistake an error about compute (User does not have bigquery.jobs.create permission in project {project-A}. for an error about storage, and this incorrect diagnosis resulted in a fix that inappropriately defaults the compute project to the project named in destination/source (depending on the operator) table.

The change attempts to allow users to override this (imo incorrect) default, but unfortunately this does not currently work because self.project_id gets overridden with the named table's project here and here.

What you think should happen instead

I think that the easiest fix would be to revert the change, and return to defaulting the compute project to the one specified in the default google cloud connection. However, since I can understand the desire to override the project_id, I think handling it correctly, and clearly distinguishing between the concepts of storage and compute w/r/t projects would also work.

How to reproduce

Attempt to use any other project for running the job, besides the one named in the source/destination table

Operating System

debian

Versions of Apache Airflow Providers

apache-airflow-providers-google==10.0.0

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@bhagany bhagany added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jun 23, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 23, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@bhagany
Copy link
Contributor Author

bhagany commented Jun 23, 2023

I should mention that I am about to go on a 2-week vacation, so while I am willing to discuss and submit a PR, my responses will be delayed. I also think this is a serious enough bug for someone else to take on if they have the time.

@bhagany
Copy link
Contributor Author

bhagany commented Jun 23, 2023

Also, #32093 is related to the inconsistent handling of project_ids caused by the PR I linked in the description. I filed this ticket because I felt the need to call attention to the broader range of issues

@eladkal
Copy link
Contributor

eladkal commented Jun 30, 2023

@irvifa @nathadfield @avinashpandeshwar

I'm a bit lost here. There are 4 open PRs for this issue
#32144
#32143
#32095
#32232

Kindly asking to take under advisment @hussein-awala comment in #32095 (review) and raise 1 PR to address all occurrences (or any way you decide to sort this which makes it easier to review)

cc @potiuk

@nathadfield
Copy link
Collaborator

@eladkal Yeah, it's definitely something that needs addressing more broadly. Will bear all of these in mind.

@avinashpandeshwar
Copy link
Contributor

@eladkal Yeah, your'e right. I will be merging these fixes into my PR.

@bhagany
Copy link
Contributor Author

bhagany commented Jul 3, 2023

I'm still on vacation, but popping in here to add my 2 cents regarding your PR and comment here @avinashpandeshwar -- I don't think these PR's are mutually compatible, with regard to how they treat @Yaro1's changes. I see that your PR currently keeps the default to the storage project if project_id is not provided an an argument, but I would make a case that this is probably more confusing than it's worth. The storage project and the compute project are different things, and conflating them in these operators only furthers misunderstandings that exist about this. This is all only my opinion, of course.

I agree with @hussein-awala's comment on #32095, linked above, and would prefer the approach taken in #32143, which keeps the project_id parameter, and falls back to the hook's project_id. This behavior is the most consistent with other operators, and avoids the storage/compute project confusion

@avinashpandeshwar
Copy link
Contributor

@bhagany I have corrected my PR #32232 so that hook's project_id is the fallback for both storage and compute of the bq_to_gcs and gcs_to_bq jobs. The project_id param is the first preference for compute, if provided. Similarly, the table's project_id is the first preference for storage, if specified.

@sukalpv
Copy link

sukalpv commented Sep 15, 2023

@avinashpandeshwar May I know which version of apache-airflow-providers-google has the fix. I am facing this issue on the latest version 10.8.0.
Running it on GCP composer 2.4.2 and airflow version 2.5.3
Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
6 participants