Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCSToGCSOperator cannot copy a single file/folder without copying other files/folders with that prefix #22675

Closed
1 of 2 tasks
Yao-ATG opened this issue Apr 1, 2022 · 10 comments · Fixed by #24039
Closed
1 of 2 tasks
Assignees
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@Yao-ATG
Copy link

Yao-ATG commented Apr 1, 2022

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

No response

Apache Airflow version

2.2.4 (latest released)

Operating System

MacOS 12.2.1

Deployment

Composer

Deployment details

No response

What happened

I have file "hourse.jpeg" and "hourse.jpeg.copy" and a folder "hourse.jpeg.folder" in source bucket.
I use the following code to try to copy only "hourse.jpeg" to another bucket.
gcs_to_gcs_op = GCSToGCSOperator(
task_id="gcs_to_gcs",
source_bucket=my_source_bucket,
source_object="hourse.jpeg",
destination_bucket=my_destination_bucket
)

The result is the two files and one folder mentioned above are copied.
From the source code it seems there is no way to do what i want.

What you think should happen instead

Only the file specified should be copied, that means we should treat source_object as exact match instead of prefix.
To accomplish the current behavior as prefix, the user can/should use wild char
source_object="hourse.jpeg*"

How to reproduce

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Yao-ATG Yao-ATG added area:providers kind:bug This is a clearly a bug labels Apr 1, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 1, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@Yao-ATG Yao-ATG changed the title GCSToGCSOperator cannot a copy a single file/folder without copying other files/folders with that prefix GCSToGCSOperator cannot copy a single file/folder without copying other files/folders with that prefix Apr 1, 2022
@potiuk
Copy link
Member

potiuk commented Apr 1, 2022

See the documentation (docstring) where you can have examples.

As unitutitive as it is, source_object is a wildcard specification by default. If you want to copy single object you need specify it like that:

source_objects = [ 'your_object' ] 

See examples here: https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/transfers/gcs_to_gcs/index.html

@potiuk potiuk added the invalid label Apr 1, 2022
@potiuk potiuk closed this as completed Apr 1, 2022
@Yao-ATG
Copy link
Author

Yao-ATG commented Apr 1, 2022

See the documentation (docstring) where you can have examples.

As unitutitive as it is, source_object is a wildcard specification by default. If you want to copy single object you need specify it like that:

source_objects = [ 'your_object' ] 

See examples here: https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/transfers/gcs_to_gcs/index.html

Unfortunately using source_objects instead of source_object doesn't help.
I verified it, by running the DAG, both before opening the issue and after your reply.
Also from source code we can see there is no difference between specifying an file in source_object and in source_objects.
https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/transfers/gcs_to_gcs.py, line 246 we have
if self.source_object:
self.source_objects = [self.source_object]
and we work only on source_objects afterwords, treating the object as prefix.

@potiuk potiuk reopened this Apr 1, 2022
@potiuk
Copy link
Member

potiuk commented Apr 1, 2022

I see I looked at the code and ideed. I marked it as good first issue and maybe someone woudl like to work on it.

Note that the fastest and surest way to get it implemented is if you make a PR yourself and lead it to completion. Would you like to contribute such a change ? Happy to review the code. If not then ti will have to wait for someone to pick it up.

@Yao-ATG
Copy link
Author

Yao-ATG commented Apr 1, 2022

Before anybody try to fix it, can we clarify the expected behavior?

Let's limit the discussion only on objects without wild char.
The current behavior is to treat the object as a prefix, no matter specified in source_object or source_objects.
For the correct behavior we have two options:
(1) treat object as exact match, also no matter specified in source_object or source_objects, and let users to add wildchar if they want the current result.
(2) treat source_object as prefix and source_objects as exact match.

Which option is the one we should take?

@potiuk
Copy link
Member

potiuk commented Apr 1, 2022

I think that can be discussed when PR is opened. I have no opinion. Maybe you can pick something as a proposal, and the reviewer reviewing the PR migh decide which one is ok.

@potiuk
Copy link
Member

potiuk commented Apr 1, 2022

For sure it shoudl be backwards compatible ideally though.

@potiuk
Copy link
Member

potiuk commented Apr 1, 2022

I think flag with "exact_match_when_no_wildcard" (default False) might be a good solution.

@josh-fell josh-fell removed the invalid label Apr 5, 2022
@spatocode
Copy link

@potiuk I'll be picking this up

@potiuk
Copy link
Member

potiuk commented May 17, 2022

Assigned you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
5 participants