Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Rename read_ipc_file to read_arrow_file & highlight arrow over feather #20472

Open
asfimport opened this issue Oct 25, 2022 · 10 comments
Open

Comments

@asfimport
Copy link

Following up from this mailing list conversation, I am wondering if the R package should rename read_ipc_file() / write_ipc_file()toread_arrow_file()/ write_arrow_file()`, or add an additional alias for both. It might also be helpful to update the documentation so that users read "Write an Arrow file (formerly known as a Feather file)" rather than the current Feather-named first approach, assuming there is a community decision to coalesce around the name Arrow for the file format, and the project is moving on from the name Feather.

Reporter: Stephanie Hazlitt / @stephhazlitt

Note: This issue was originally created as ARROW-18148. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Nicola Crane / @thisisnic:
+1 to doing this, have added a comment on the community discussion as it seemed to be moving towards a conclusion but not there entirely

@asfimport
Copy link
Author

Stephanie Hazlitt / @stephhazlitt:
The other {}arrow::read_*(){}functions don't use the _file in the name, just {}read_ipc_file(){}, which I am guessing is because we also have {}read_ipc_stream(){}. I wonder if read_arrow and read_arrow_stream would work and be more consistent with the naming API?

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
It's tricky because the Arrow IPC format has a "file" and a "stream" specification--but either can be written to a file.

@asfimport
Copy link
Author

Dewey Dunnington / @paleolimbot:
You can sniff which one it is from the first 8 bytes though (ARROW1 plus some zeros vs 0xFFFFFFFF or something)...I don't think we currently do that.

@asfimport
Copy link
Author

Nicola Crane / @thisisnic:
Just pinging @djnavarro as she's doing work updating our vignettes, and I'd be keen to hear her views on how best we approach this - we've touched on this a bit already on https://github.com/apache/arrow/pull/14514.

@asfimport
Copy link
Author

Danielle Navarro / @djnavarro:
Tentatively offering some thoughts :-)

If I'm understanding this properly, we have two problems:

  • The first problem is that the history of serializing Arrow objects is messy and has left us with three names that people might recognize: Feather, IPC, Arrow. We'd like users to transition to using "Arrow" as the preferred name, and to give them an API that reflects that terminology.

  • The second problem is that we use "file format" and "stream format" to mean something subtly different from "files" and "streams". The file format wraps the stream format with magic numbers at the start and end, with a footer written after the stream. These two formats aren't inherently tied to files and streams. The user can write a "stream formatted" file if they want (i.e., no magic numbers, no footers) and they can also send a "file formatted" serialization (i.e., with the magic number and footer) to an output stream if they want to. The current API allows this, but users would be forgiven for missing this subtle detail!

    Option 1: Don't change the API, only the docs

    This option would leave read_ipc_file(), write_ipc_file(), read_ipc_stream(), and write_ipc_stream() as the four user-facing functions (treating read_feather() and write_feather() as soft-deprecated, and leaving write_to_raw() untouched)

    The only thing that would change in this version is that we would consistently refer to "Arrow IPC file" and "Arrow IPC stream" everywhere (i.e., never truncating it to "IPC"). Language around "feather" would be relegated to a secondary position (e.g., "formerly known as Feather"), and we would emphasize that the preferred file extension for V2 feather files is .arrow

    Option 2: New names for the existing four functions

    This option would replace read_ipc_file() with read_arrow_file(), read_ipc_stream() with read_arrow_stream() and so on. The ipc and feather versions would be soft-deprecated.  

    The documentation would be updated accordingly. We'd now refer to "Arrow file" and "Arrow stream" everywhere. As with option 1 we'd use language like "formerly known as Feather" to explain the history (perhaps linking back to the old repo just to highlight the origin). We would also, where relevant, note that "Arrow stream" is a conventional name for the "Arrow inter-process communication (IPC) streaming format", as a way of (a) explaining the ipc versions of the functions, and (b) helping users find the relevant part of the Arrow specification.

    Option 3: Reduce API to two functions

    This option would have only two functions, read_arrow() and write_arrow(). Both functions would have a new argument called format (or something similar). Users could specify either format = "stream" or format = "file". From a documentation perspective this would require a little more finessing: we might have to have separate the help topics for the new API and older versions of API to avoid mess. But it might have the advantage of making clearer to users that the terms "stream" and "file" don't actually refer to where you're writing the data, but how you encode the data when you write it. 

    Preferences?

    I am not sure what I prefer, but I can at least say what I think the strengths and weaknesses are for each proposal:

    Option 3 seems like the cleanest in terms of making the Arrow/Feather/IPC functions feel analogous to the other functions in the read/write API: read_arrow() and write_arrow() feels closely aligned with read_parquet() and write_parquet(). It makes very clear that these functions are designed to read and write Arrow objects in an "Arrow-like" way. However, it does have the disadvantage that the encoding vs destination complexity gets pushed into the arguments: users will need to understand why there is format argument that is distinct from the file/sink argument, and the documentation will need to explain that. 

    Option 2 has the advantage of preserving the same "four-function structure"" as the existing serialization API, but it does come at the expense of being a little misleading to anyone who doesn't understand that the function names refer to the encoding not the destination: write_arrow_stream() can in fact write to a file, and write_arrow_file() can write to a stream. That's potentially even more confusing. 

    Option 1 has the advantage of not confusing existing users. The API doesn't change, and the documentation becomes slightly more informative. The disadvantage is that it leaves new users a bit confused about what the heck an "IPC" is, which means the documentation will have to carry the load. 

    Additional documentation thoughts

    Regardless of what option we go with, I'll write the user-facing vignettes to use only the newest version of the API, especially in the arrow.Rmd vignette and the read_write.Rmd vignette where new users are most likely to run across these concepts. In those places I would try my best not to dive into too much detail, because it's a complexity that new users don't need. 

    The question that arises is "where do we talk about the nuance?" To some extent I think we could move some of that to the "details" section of various help topics, but... (and I hate saying this)... it might make sense to write an "Arrow serialization" vignette that would be loosely analogous to the "Data object layout" vignette that I'm proposing to introduce in ARROW-17887: [R][Doc] Improve readability of the Get Started and README pages #14514. On the documentation page it would be grouped in with the developer vignettes (to signal that it's advanced content), but just like I'm doing with "Data object layout", I'll cross reference it from the user-facing vignettes. For instance, in the section on reading and writing arrow (formerly feather) files, there would be a short paragraph that hints at these issues, and then links the user to the serialization vignette where all the detail is unpacked.

     

     

@asfimport
Copy link
Author

Stephanie Hazlitt / @stephhazlitt:

"where do we talk about the nuance"
One approach when making a big design change is to have a short vignette that explains the change itself (e.g. dbplyr 2.0 did this). What is proposed is not a breaking change, however if the package moves to having read_arrow(..., format = c("file", "stream", "auto")) it might be worth a 101-level page on the function naming history, given there was an early version of read_arrow()` which was deprecated, the marketing that needs to be done re: feather vs arrow naming and so on. This could also be done in the proposed Arrow serialization vignette with pointers, as suggested.

@asfimport
Copy link
Author

Nicola Crane / @thisisnic:
Thanks for taking a look at this in such close detail - interesting to see the detail there around "stream format" and "file format". I think I'm leaning towards option 3 seeming like the way to go. I'm not keen on pushing the term "IPC" on users who are otherwise unaware of it and don't need to be aware of it, and like the API in option 3.

Perhaps for now, disregard all of this in that PR you have open and those further docs updates can be made in a follow-up PR once the work to update these functions is done?

@asfimport
Copy link
Author

Danielle Navarro / @djnavarro:
Agreed. For the current PR I'll write it as though the API weren't changing, but will still preference the term "Arrow" over "Feather" where that's relevant

@Divyansh200102
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment