[AIRFLOW- boyscout] Enforce delimiter for gcs_to_gcs operator using a flag, enforce_delimiter#4619
[AIRFLOW- boyscout] Enforce delimiter for gcs_to_gcs operator using a flag, enforce_delimiter#4619icaroNZ wants to merge 1 commit intoapache:masterfrom
Conversation
… flag, enforce_delimeter
|
-1 for this one as we should have the same options as in the GCS API. Using correct combination of |
|
Delimiter is passed to gcs_hook but its value is never used, as per the explanation above. Took us a long time to notice that compact files was been copied with the csv files when we use *.csv as the delimiter .csv is never used anything after the * was copied including the .gz and .csv.gz files |
|
You can't (or probably shouldn't) use multiple wildcards. We already have this in our docstring + google has also listed it in there API docs.
|
|
I am not saying multiples I am saying just one |
|
Ok. This needs to be tested as it was working perfectly fine. I will try to spend some time over the weekend looking over this as it was working perfectly fine. And will let you know, thanks for the contribution @icaroNZ |
|
If the behavior is as shown in the example this is indeed a bug. @kaxil did you have time to check this? |
|
@icaroNZ @kurtqq I am not able to reproduce this with the code in master. Following is the test I conducted: Files inside my test bucket: DAG: Logs: When I run change source_object to Can you guys confirm if you can reproduce it with the code in master? |
|
can't reproduce either |
|
It seems to me that this change is no longer applicable, so i close this PR. If I am wrong, please open this PR again. |
Problem now:
Given the files: test1.csv, test2.csv, test10.csv, test100.csv, test1.gz, test2.gz, test10.gz, test100.gz
When trying to match test*.csv
Result all files above is match
Fix:
Given the files: test1.csv, test2.csv, test10.csv, test100.csv, test1.gz, test2.gz, test10.gz, test100.gz
When trying to match test*.csv
Result only the files test1.csv, test2.csv, test10.csv, test100.csv is a match
Problem that still in the code: when using multiple wildcards it does not enforces the 'middle part' of it:
Given the files: testProd1.csv, test2Prod.csv, testProd10.csv, testProd100.csv, testProd1.gz, test2Prod.gz, test10Prod.gz, test100Prod.gz, in directory dir1 and dir2
When trying to match /testAcceptance.csv
Result all files above is match
Expect: No files should be returned
The enforce_delimiter flag has a default value of False and do not change the current operator if the flag value is set to False or left unset.
When set to True it uses a new hook, list_with_delimiter, in this hook the value after the last wildcard '*' is enforced.
Notice that this PR fix only the problem of enforcing the last part of the path, the middle part stays as it is, as per above