You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to understand a failing test in #12811 (comment), I noticed that the write_dataset function does not actually always raise an error by default if there is already existing data in the target location.
The documentation says it will raise "if any data exists in the destination" (which is also what I would expect), but in practice it seems that it does ignore certain file names:
importpyarrow.datasetasdstable=pa.table({'a': [1, 2, 3]})
# write a first time to new directory: OK>>>ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !lstest_overwritepart-0.parquet# write a second time to the same directory: passes, but should raise?>>>ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !lstest_overwritepart-0.parquet# write a another time to the same directory with different name: still passes>>>ds.write_dataset(table, "test_overwrite", format="parquet", basename_template="data-{i}.parquet")
>>> !lstest_overwritedata-0.parquetpart-0.parquet# now writing again finally raises an error>>>ds.write_dataset(table, "test_overwrite", format="parquet")
...
ArrowInvalid: Couldnotwritetotest_overwriteasthedirectoryisnotemptyandexisting_data_behavioristoerror
So it seems that when checking if existing data exists, it seems to ignore any files that match the basename template pattern.
cc @westonpace do you know if this was intentional? (I would find that a strange corner case, and in any case it is also not documented)
Joris Van den Bossche / @jorisvandenbossche:
Strange thing is that the function that checks for this and potentially raises that error doesn't have any such logic:
Joris Van den Bossche / @jorisvandenbossche:
Ah, it's not because of the existing file matching the template, but because of only a single file being present: the maybe_files->size() > 1 should be >=1 instead.
While trying to understand a failing test in #12811 (comment), I noticed that the
write_dataset
function does not actually always raise an error by default if there is already existing data in the target location.The documentation says it will raise "if any data exists in the destination" (which is also what I would expect), but in practice it seems that it does ignore certain file names:
So it seems that when checking if existing data exists, it seems to ignore any files that match the basename template pattern.
cc @westonpace do you know if this was intentional? (I would find that a strange corner case, and in any case it is also not documented)
Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche
PRs and other links:
Note: This issue was originally created as ARROW-16204. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: