Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file #20193

Closed
asfimport opened this issue Apr 15, 2022 · 5 comments

Comments

@asfimport
Copy link

While trying to understand a failing test in #12811 (comment), I noticed that the write_dataset function does not actually always raise an error by default if there is already existing data in the target location.

The documentation says it will raise "if any data exists in the destination" (which is also what I would expect), but in practice it seems that it does ignore certain file names:

import pyarrow.dataset as ds
table = pa.table({'a': [1, 2, 3]})

# write a first time to new directory: OK
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !ls test_overwrite
part-0.parquet

# write a second time to the same directory: passes, but should raise?
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !ls test_overwrite
part-0.parquet

# write a another time to the same directory with different name: still passes
>>> ds.write_dataset(table, "test_overwrite", format="parquet", basename_template="data-{i}.parquet")
>>> !ls test_overwrite
data-0.parquet	part-0.parquet

# now writing again finally raises an error
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
...
ArrowInvalid: Could not write to test_overwrite as the directory is not empty and existing_data_behavior is to error

So it seems that when checking if existing data exists, it seems to ignore any files that match the basename template pattern.

cc @westonpace do you know if this was intentional? (I would find that a strange corner case, and in any case it is also not documented)

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-16204. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Strange thing is that the function that checks for this and potentially raises that error doesn't have any such logic:

Status EnsureDestinationValid(const FileSystemDatasetWriteOptions& options) {
if (options.existing_data_behavior == ExistingDataBehavior::kError) {
fs::FileSelector selector;
selector.base_dir = options.base_dir;
selector.recursive = true;
Result<std::vector<fs::FileInfo>> maybe_files =
options.filesystem->GetFileInfo(selector);
if (!maybe_files.ok()) {
// If the path doesn't exist then continue
return Status::OK();
}
if (maybe_files->size() > 1) {
return Status::Invalid(
"Could not write to ", options.base_dir,
" as the directory is not empty and existing_data_behavior is to error");
}
}
return Status::OK();
}

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Ah, it's not because of the existing file matching the template, but because of only a single file being present: the maybe_files->size() > 1 should be >=1 instead.

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
Postponing to 9.0

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
This is a tiny (but useful) bug fix with a PR that is ready, so I added it back to 8.0 milestone

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 12898
#12898

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants