[C++] Enable CSV Writer to append / overwrite existing file #30429

asfimport · 2021-11-29T15:49:18Z

This would be a match for the readr::write_csv() append argument: boolean. If FALSE will overwrite existing file. If TRUE will append to existing file. In both cases, if the file doesn't exist, a new file is created.

Reporter: Dragoș Moldovan-Grünfeld / @dragosmg

Related issues:

[C++][Python] Un-deprecate FileSystem::OpenAppendStream (relates to)

_{Note: This issue was originally created as ARROW-14904. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2021-11-30T21:24:57Z

Weston Pace / @westonpace:
Opening an append stream is on the path to be deprecated. It isn't supported on all filesystems (e.g. S3) or all formats (e.g. parquet). Overwrite behavior should exist today.

That being said I can see the potential advantages when it comes to CSV. CC @pitrou

asfimport · 2021-11-30T21:26:14Z

Antoine Pitrou / @pitrou:
Do we have to emulate everything that's provided by another R library? Or is there a real-world use case for this?

asfimport · 2021-12-02T10:33:12Z

Dragoș Moldovan-Grünfeld / @dragosmg:
@pitrou I don't think we do. There is functionality we're not emulating - for example, the num_threads argument in readr::write_csv() - mostly because multithreading is implemented differently in {}arrow{}. On the real-world use - I've definitely used the readr::write_csv(..., append = TRUE) functionality in the past.

Any thoughts @nealrichardson, [~ianmcook] @jonkeane?

asfimport · 2021-12-02T13:35:37Z

Neal Richardson / @nealrichardson:
Apologies if I'm reopening a debate that's been settled, but just because appending isn't supported on all filesystems, why does that mean we can't allow it on filesystems where it is supported?

> Do we have to emulate everything that's provided by another R library?

Absolutely not, and likewise with pandas or any other package. But for every feature they have, there's a reason it exists, and we should evaluate whether it seems like a good reason--or, at least decide to wait until someone asks for it.

asfimport · 2021-12-02T13:42:51Z

Antoine Pitrou / @pitrou:

Apologies if I'm reopening a debate that's been settled, but just because appending isn't supported on all filesystems, why does that mean we can't allow it on filesystems where it is supported?

We can, it's just that it's less useful :-) Nobody opposed when we started deprecating it, but we can un-deprecate if desired.

asfimport · 2021-12-02T13:54:55Z

Neal Richardson / @nealrichardson:
I think we just identified a use :)

asfimport · 2021-12-02T18:49:11Z

Weston Pace / @westonpace:
I'm a little bit torn here. Append is definitely something that users want. It is asked for a lot[1][2][3] (this is just a sample, there are at least 5 variations of "how do I append to parquet", and some on the ML too).

But the answer is very confusing to users. The parquet format page has confused many people with this line:

The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.

Spark further confuses the picture with "SaveMode.Append" which is documented as:

Append mode means that when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.

But...what is actually happening is it is either reading in the file and rewriting it or creating a new file in the same "dataset" (I don't recall off the top of my head which of these two it is).

So it has been useful for me to be able to parrot a simple line "No. You cannot append to an existing file. The preferred operation is to create a new file in the same dataset. If you are doing many small writes then you can concatenate them in memory or you can periodically merge files after they are written".

So I guess I worry about the slippery slope. "Users might sometimes want to append data so lets add that to the filesystem" leads to "Users want to be able to append to CSV files" leads to "We should add an append mode to write_dataset since there is at least one format that supports it" which leads to further confusing users.

I won't stand in the way of adding append to CSV if wanted but I would be pretty stubborn about adding append to write_dataset.

[1] https://stackoverflow.com/questions/44608076/can-you-append-to-a-feather-format
[2] https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file
[3] https://stackoverflow.com/questions/38793170/appending-to-orc-file

This was referenced Jan 11, 2023

[R] Update write_csv_arrow() to support all args of readr::write_csv() #30427

Open

[C++][Python] Un-deprecate FileSystem::OpenAppendStream #30491

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Enable CSV Writer to append / overwrite existing file #30429

[C++] Enable CSV Writer to append / overwrite existing file #30429

asfimport commented Nov 29, 2021 •

edited

Loading

asfimport commented Nov 30, 2021

asfimport commented Nov 30, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

[C++] Enable CSV Writer to append / overwrite existing file #30429

[C++] Enable CSV Writer to append / overwrite existing file #30429

Comments

asfimport commented Nov 29, 2021 • edited Loading

Related issues:

asfimport commented Nov 30, 2021

asfimport commented Nov 30, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Nov 29, 2021 •

edited

Loading