Skip to content

[AIRFLOW- boyscout] Enforce delimiter for gcs_to_gcs operator using a flag, enforce_delimiter#4619

Closed
icaroNZ wants to merge 1 commit intoapache:masterfrom
icaroNZ:master
Closed

[AIRFLOW- boyscout] Enforce delimiter for gcs_to_gcs operator using a flag, enforce_delimiter#4619
icaroNZ wants to merge 1 commit intoapache:masterfrom
icaroNZ:master

Conversation

@icaroNZ
Copy link

@icaroNZ icaroNZ commented Jan 30, 2019

Problem now:
Given the files: test1.csv, test2.csv, test10.csv, test100.csv, test1.gz, test2.gz, test10.gz, test100.gz
When trying to match test*.csv
Result all files above is match
Fix:
Given the files: test1.csv, test2.csv, test10.csv, test100.csv, test1.gz, test2.gz, test10.gz, test100.gz
When trying to match test*.csv
Result only the files test1.csv, test2.csv, test10.csv, test100.csv is a match

Problem that still in the code: when using multiple wildcards it does not enforces the 'middle part' of it:
Given the files: testProd1.csv, test2Prod.csv, testProd10.csv, testProd100.csv, testProd1.gz, test2Prod.gz, test10Prod.gz, test100Prod.gz, in directory dir1 and dir2
When trying to match /testAcceptance.csv
Result all files above is match
Expect: No files should be returned

The enforce_delimiter flag has a default value of False and do not change the current operator if the flag value is set to False or left unset.
When set to True it uses a new hook, list_with_delimiter, in this hook the value after the last wildcard '*' is enforced.
Notice that this PR fix only the problem of enforcing the last part of the path, the middle part stays as it is, as per above

@kaxil
Copy link
Member

kaxil commented Jan 30, 2019

-1 for this one as we should have the same options as in the GCS API. Using correct combination of prefix and delimeter should solve the issue you refer to.

@icaroNZ
Copy link
Author

icaroNZ commented Jan 30, 2019

Delimiter is passed to gcs_hook but its value is never used, as per the explanation above. Took us a long time to notice that compact files was been copied with the csv files when we use *.csv as the delimiter .csv is never used anything after the * was copied including the .gz and .csv.gz files

@kaxil
Copy link
Member

kaxil commented Jan 30, 2019

You can't (or probably shouldn't) use multiple wildcards. We already have this in our docstring + google has also listed it in there API docs.

You can use only one wildcard for objects

https://github.com/apache/airflow/blob/b37366c636be7216d7abc12655106b83b9e1397b/airflow/contrib/operators/gcs_to_gcs.py#L32:L36

You can use only one wildcard for objects (filenames) within your

@icaroNZ
Copy link
Author

icaroNZ commented Jan 30, 2019

I am not saying multiples I am saying just oneThe wildcard can appear inside the object name or at the end of the object name. what is not the case, if you try to use test*.csv you will also select files like test.gz.

@kaxil
Copy link
Member

kaxil commented Jan 30, 2019

Ok. This needs to be tested as it was working perfectly fine. I will try to spend some time over the weekend looking over this as it was working perfectly fine. And will let you know, thanks for the contribution @icaroNZ

@kurtqq
Copy link
Contributor

kurtqq commented Apr 13, 2019

If the behavior is as shown in the example this is indeed a bug.
I use the GCS operators and never encounter this however in my case there is no mix of different file types in the same directory.

@kaxil did you have time to check this?

@kaxil
Copy link
Member

kaxil commented Apr 13, 2019

@icaroNZ @kurtqq I am not able to reproduce this with the code in master. Following is the test I conducted:

Files inside my test bucket:

a.csv
a.py
a2.py

DAG:

GoogleCloudStorageToGoogleCloudStorageOperator(
    task_id='copy_files',
    source_bucket=TEST_BUCKET,
    source_object='a*.csv',
    destination_bucket=TEST_BUCKET2,
    destination_object='copied_objects/',
)

Logs:

[2019-04-13 21:57:10,622] {gcs_hook.py:139} INFO - Total Bytes: 1034328 | Bytes Written: 1034328
[2019-04-13 21:57:10,624] {gcs_hook.py:151} INFO - Object a.csv in bucket *** copied to object a.csv in bucket ***

When I run change source_object to a*.py, I get the following:
Logs:

[2019-04-13 21:58:39,162] {gcs_hook.py:139} INFO - Total Bytes: 1034328 | Bytes Written: 1034328
[2019-04-13 21:58:39,167] {gcs_hook.py:151} INFO - Object a.py in bucket *** copied to object a.py in bucket ***
[2019-04-13 21:58:39,458] {gcs_hook.py:139} INFO - Total Bytes: 1034328 | Bytes Written: 1034328
[2019-04-13 21:58:39,460] {gcs_hook.py:151} INFO -  Object a2.py in bucket *** copied to object a2.py in bucket ***

Can you guys confirm if you can reproduce it with the code in master?

@kurtqq
Copy link
Contributor

kurtqq commented Apr 14, 2019

can't reproduce either

@mik-laj mik-laj added the provider:google Google (including GCP) related issues label Jul 22, 2019
@mik-laj
Copy link
Member

mik-laj commented Aug 13, 2019

It seems to me that this change is no longer applicable, so i close this PR. If I am wrong, please open this PR again.

@mik-laj mik-laj closed this Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants