[C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file #20193

asfimport · 2022-04-15T08:48:16Z

While trying to understand a failing test in #12811 (comment), I noticed that the write_dataset function does not actually always raise an error by default if there is already existing data in the target location.

The documentation says it will raise "if any data exists in the destination" (which is also what I would expect), but in practice it seems that it does ignore certain file names:

import pyarrow.dataset as ds
table = pa.table({'a': [1, 2, 3]})

# write a first time to new directory: OK
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !ls test_overwrite
part-0.parquet

# write a second time to the same directory: passes, but should raise?
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !ls test_overwrite
part-0.parquet

# write a another time to the same directory with different name: still passes
>>> ds.write_dataset(table, "test_overwrite", format="parquet", basename_template="data-{i}.parquet")
>>> !ls test_overwrite
data-0.parquet	part-0.parquet

# now writing again finally raises an error
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
...
ArrowInvalid: Could not write to test_overwrite as the directory is not empty and existing_data_behavior is to error

So it seems that when checking if existing data exists, it seems to ignore any files that match the basename template pattern.

cc @westonpace do you know if this was intentional? (I would find that a strange corner case, and in any case it is also not documented)

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

GitHub Pull Request #12898

_{Note: This issue was originally created as ARROW-16204. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-04-15T08:56:10Z

Joris Van den Bossche / @jorisvandenbossche:
Strange thing is that the function that checks for this and potentially raises that error doesn't have any such logic:

arrow/cpp/src/arrow/dataset/dataset_writer.cc

Lines 413 to 431 in 63d2a9c

    
           Status EnsureDestinationValid(const FileSystemDatasetWriteOptions& options) { 
        
             if (options.existing_data_behavior == ExistingDataBehavior::kError) { 
        
               fs::FileSelector selector; 
        
               selector.base_dir = options.base_dir; 
        
               selector.recursive = true; 
        
               Result<std::vector<fs::FileInfo>> maybe_files = 
        
                   options.filesystem->GetFileInfo(selector); 
        
               if (!maybe_files.ok()) { 
        
                 // If the path doesn't exist then continue 
        
                 return Status::OK(); 
        
               } 
        
               if (maybe_files->size() > 1) { 
        
                 return Status::Invalid( 
        
                     "Could not write to ", options.base_dir, 
        
                     " as the directory is not empty and existing_data_behavior is to error"); 
        
               } 
        
             } 
        
             return Status::OK(); 
        
           }

asfimport · 2022-04-15T08:58:34Z

Joris Van den Bossche / @jorisvandenbossche:
Ah, it's not because of the existing file matching the template, but because of only a single file being present: the maybe_files->size() > 1 should be >=1 instead.

asfimport · 2022-04-21T18:56:31Z

Krisztian Szucs / @kszucs:
Postponing to 9.0

asfimport · 2022-04-22T07:23:10Z

Joris Van den Bossche / @jorisvandenbossche:
This is a tiny (but useful) bug fix with a PR that is ready, so I added it back to 8.0 milestone

asfimport · 2022-04-22T16:11:57Z

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 12898
#12898

asfimport closed this as completed Apr 22, 2022

asfimport assigned jorisvandenbossche Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file #20193

[C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file #20193

asfimport commented Apr 15, 2022

asfimport commented Apr 15, 2022

asfimport commented Apr 15, 2022

asfimport commented Apr 21, 2022

asfimport commented Apr 22, 2022

asfimport commented Apr 22, 2022

[C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file #20193

[C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file #20193

Comments

asfimport commented Apr 15, 2022

PRs and other links:

asfimport commented Apr 15, 2022

asfimport commented Apr 15, 2022

asfimport commented Apr 21, 2022

asfimport commented Apr 22, 2022

asfimport commented Apr 22, 2022