Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Enable CSV Writer to append / overwrite existing file #30429

Open
Tracked by #30427
asfimport opened this issue Nov 29, 2021 · 7 comments
Open
Tracked by #30427

[C++] Enable CSV Writer to append / overwrite existing file #30429

asfimport opened this issue Nov 29, 2021 · 7 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Nov 29, 2021

This would be a match for the readr::write_csv() append argument: boolean. If FALSE will overwrite existing file. If TRUE will append to existing file. In both cases, if the file doesn't exist, a new file is created. 

Reporter: Dragoș Moldovan-Grünfeld / @dragosmg

Related issues:

Note: This issue was originally created as ARROW-14904. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Opening an append stream is on the path to be deprecated. It isn't supported on all filesystems (e.g. S3) or all formats (e.g. parquet). Overwrite behavior should exist today.

That being said I can see the potential advantages when it comes to CSV. CC @pitrou

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Do we have to emulate everything that's provided by another R library? Or is there a real-world use case for this?

@asfimport
Copy link
Collaborator Author

Dragoș Moldovan-Grünfeld / @dragosmg:
@pitrou  I don't think we do. There is functionality we're not emulating - for example, the num_threads argument in readr::write_csv() - mostly because multithreading is implemented differently in {}arrow{}. On the real-world use - I've definitely used the readr::write_csv(..., append = TRUE) functionality in the past. 

Any thoughts @nealrichardson, [~ianmcook] @jonkeane?

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
Apologies if I'm reopening a debate that's been settled, but just because appending isn't supported on all filesystems, why does that mean we can't allow it on filesystems where it is supported?

 > Do we have to emulate everything that's provided by another R library?

Absolutely not, and likewise with pandas or any other package. But for every feature they have, there's a reason it exists, and we should evaluate whether it seems like a good reason--or, at least decide to wait until someone asks for it.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:

Apologies if I'm reopening a debate that's been settled, but just because appending isn't supported on all filesystems, why does that mean we can't allow it on filesystems where it is supported?

We can, it's just that it's less useful :-) Nobody opposed when we started deprecating it, but we can un-deprecate if desired.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
I think we just identified a use :)

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
I'm a little bit torn here. Append is definitely something that users want. It is asked for a lot[1][2][3] (this is just a sample, there are at least 5 variations of "how do I append to parquet", and some on the ML too).

But the answer is very confusing to users. The parquet format page has confused many people with this line:

The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.

Spark further confuses the picture with "SaveMode.Append" which is documented as:

Append mode means that when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.

But...what is actually happening is it is either reading in the file and rewriting it or creating a new file in the same "dataset" (I don't recall off the top of my head which of these two it is).

So it has been useful for me to be able to parrot a simple line "No. You cannot append to an existing file. The preferred operation is to create a new file in the same dataset. If you are doing many small writes then you can concatenate them in memory or you can periodically merge files after they are written".

So I guess I worry about the slippery slope. "Users might sometimes want to append data so lets add that to the filesystem" leads to "Users want to be able to append to CSV files" leads to "We should add an append mode to write_dataset since there is at least one format that supports it" which leads to further confusing users.

I won't stand in the way of adding append to CSV if wanted but I would be pretty stubborn about adding append to write_dataset.

[1] https://stackoverflow.com/questions/44608076/can-you-append-to-a-feather-format
[2] https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file
[3] https://stackoverflow.com/questions/38793170/appending-to-orc-file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant