[AIRFLOW- boyscout] Enforce delimiter for gcs_to_gcs operator using a flag, enforce_delimiter by icaroNZ · Pull Request #4619 · apache/airflow

icaroNZ · 2019-01-30T08:41:10Z

Problem now:
Given the files: test1.csv, test2.csv, test10.csv, test100.csv, test1.gz, test2.gz, test10.gz, test100.gz
When trying to match test*.csv
Result all files above is match
Fix:
Given the files: test1.csv, test2.csv, test10.csv, test100.csv, test1.gz, test2.gz, test10.gz, test100.gz
When trying to match test*.csv
Result only the files test1.csv, test2.csv, test10.csv, test100.csv is a match

Problem that still in the code: when using multiple wildcards it does not enforces the 'middle part' of it:
Given the files: testProd1.csv, test2Prod.csv, testProd10.csv, testProd100.csv, testProd1.gz, test2Prod.gz, test10Prod.gz, test100Prod.gz, in directory dir1 and dir2
When trying to match /testAcceptance.csv
Result all files above is match
Expect: No files should be returned

The enforce_delimiter flag has a default value of False and do not change the current operator if the flag value is set to False or left unset.
When set to True it uses a new hook, list_with_delimiter, in this hook the value after the last wildcard '*' is enforced.
Notice that this PR fix only the problem of enforcing the last part of the path, the middle part stays as it is, as per above

… flag, enforce_delimeter

kaxil · 2019-01-30T13:21:01Z

-1 for this one as we should have the same options as in the GCS API. Using correct combination of prefix and delimeter should solve the issue you refer to.

icaroNZ · 2019-01-30T14:19:01Z

Delimiter is passed to gcs_hook but its value is never used, as per the explanation above. Took us a long time to notice that compact files was been copied with the csv files when we use *.csv as the delimiter .csv is never used anything after the * was copied including the .gz and .csv.gz files

kaxil · 2019-01-30T14:32:05Z

You can't (or probably shouldn't) use multiple wildcards. We already have this in our docstring + google has also listed it in there API docs.

You can use only one wildcard for objects

https://github.com/apache/airflow/blob/b37366c636be7216d7abc12655106b83b9e1397b/airflow/contrib/operators/gcs_to_gcs.py#L32:L36

airflow/airflow/contrib/operators/gcs_to_gcs.py

Line 34 in b37366c

You can use only one wildcard for objects (filenames) within your

icaroNZ · 2019-01-30T14:49:22Z

I am not saying multiples I am saying just oneThe wildcard can appear inside the object name or at the end of the object name. what is not the case, if you try to use test*.csv you will also select files like test.gz.

kaxil · 2019-01-30T15:01:41Z

Ok. This needs to be tested as it was working perfectly fine. I will try to spend some time over the weekend looking over this as it was working perfectly fine. And will let you know, thanks for the contribution @icaroNZ

kurtqq · 2019-04-13T16:24:40Z

If the behavior is as shown in the example this is indeed a bug.
I use the GCS operators and never encounter this however in my case there is no mix of different file types in the same directory.

@kaxil did you have time to check this?

kaxil · 2019-04-13T21:01:03Z

@icaroNZ @kurtqq I am not able to reproduce this with the code in master. Following is the test I conducted:

Files inside my test bucket:

a.csv
a.py
a2.py

DAG:

GoogleCloudStorageToGoogleCloudStorageOperator(
    task_id='copy_files',
    source_bucket=TEST_BUCKET,
    source_object='a*.csv',
    destination_bucket=TEST_BUCKET2,
    destination_object='copied_objects/',
)

Logs:

[2019-04-13 21:57:10,622] {gcs_hook.py:139} INFO - Total Bytes: 1034328 | Bytes Written: 1034328
[2019-04-13 21:57:10,624] {gcs_hook.py:151} INFO - Object a.csv in bucket *** copied to object a.csv in bucket ***

When I run change source_object to a*.py, I get the following:
Logs:

[2019-04-13 21:58:39,162] {gcs_hook.py:139} INFO - Total Bytes: 1034328 | Bytes Written: 1034328
[2019-04-13 21:58:39,167] {gcs_hook.py:151} INFO - Object a.py in bucket *** copied to object a.py in bucket ***
[2019-04-13 21:58:39,458] {gcs_hook.py:139} INFO - Total Bytes: 1034328 | Bytes Written: 1034328
[2019-04-13 21:58:39,460] {gcs_hook.py:151} INFO -  Object a2.py in bucket *** copied to object a2.py in bucket ***

Can you guys confirm if you can reproduce it with the code in master?

kurtqq · 2019-04-14T16:46:18Z

can't reproduce either

mik-laj · 2019-08-13T02:32:44Z

It seems to me that this change is no longer applicable, so i close this PR. If I am wrong, please open this PR again.

icaroNZ force-pushed the master branch from 3bfd341 to 14086f6 Compare January 30, 2019 09:10

[AIRFLOW- boyscout] Enforce delimiter for gcs_to_gcs operator using a…

b37366c

… flag, enforce_delimeter

icaroNZ force-pushed the master branch from 14086f6 to b37366c Compare January 30, 2019 09:51

mik-laj added the provider:google Google (including GCP) related issues label Jul 22, 2019

mik-laj closed this Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIRFLOW- boyscout] Enforce delimiter for gcs_to_gcs operator using a flag, enforce_delimiter#4619

[AIRFLOW- boyscout] Enforce delimiter for gcs_to_gcs operator using a flag, enforce_delimiter#4619
icaroNZ wants to merge 1 commit intoapache:masterfrom
icaroNZ:master

icaroNZ commented Jan 30, 2019

Uh oh!

kaxil commented Jan 30, 2019

Uh oh!

icaroNZ commented Jan 30, 2019

Uh oh!

kaxil commented Jan 30, 2019 •

edited

Loading

Uh oh!

icaroNZ commented Jan 30, 2019

Uh oh!

kaxil commented Jan 30, 2019

Uh oh!

kurtqq commented Apr 13, 2019

Uh oh!

kaxil commented Apr 13, 2019

Uh oh!

kurtqq commented Apr 14, 2019

Uh oh!

mik-laj commented Aug 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

icaroNZ commented Jan 30, 2019

Uh oh!

kaxil commented Jan 30, 2019

Uh oh!

icaroNZ commented Jan 30, 2019

Uh oh!

kaxil commented Jan 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icaroNZ commented Jan 30, 2019

Uh oh!

kaxil commented Jan 30, 2019

Uh oh!

kurtqq commented Apr 13, 2019

Uh oh!

kaxil commented Apr 13, 2019

Uh oh!

kurtqq commented Apr 14, 2019

Uh oh!

mik-laj commented Aug 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kaxil commented Jan 30, 2019 •

edited

Loading