Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Discussion: tibble dependency in R package #16714

Closed
asfimport opened this issue Apr 21, 2019 · 2 comments
Closed

[R] Discussion: tibble dependency in R package #16714

asfimport opened this issue Apr 21, 2019 · 2 comments

Comments

@asfimport
Copy link

Hello,

 

I would like to have a discussion on the use of tibble in the Apache Arrow R package. I looked at the [the project contributor guidelines|[https://github.com/apache/arrow/blob/master/docs/source/developers/contributing.rst]] and could not tell where the best place might be to start a public discussion on this topic, so I decided on JIRA. I apologize if this is not the right place.

 

TL;DR

I would like to propose moving the tibble dependency in the arrow R package to "Suggests", removing the as_tibble() in read_arrow(), and having the core R code implementing the Arrow API only return data.frames or other base-R data structures wherever possible.

 

Reasoning

[As far as I can tell|[https://github.com/apache/arrow/search?p=1&q=tibble&unscoped_q=tibble]], outside of tests and examples tibble is only used in three places in the package:

  • S3 methods to convert Arrow objects to tibbles (as_tibble.arrow_::__RecordBatch()_, as.tibble.arrow::Table())

  • optional "convert to tibble on the way out" behavior controlled by a flag in interfaces to file types (parquet and feather)

  • [read_arrow()|[https://github.com/apache/arrow/blob/0536ef8174982a7a13a251174cc38701e8663b68/r/R/read_table.R#L88]]

     

    In my opinion, all three of these uses of tibble are valuable for developers who use that package (or other packages in its ecosystem), but I am not convinced that the Arrow R package should be tightly coupled to them.

    In the Python community, pandas is a broadly agreed-upon standard for representing data frames. Even with that ubiquity, pyarrow does not depend on pandas (it is not necessary to work with it) and all "compatibility with pandas" code is isolated in a place explicitly intended for that purpose: https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py

    I think that is the ideal handling for integration of Arrow extensions with other software it might be used with. This allows users who care about only one of the integrations (e.g. feather, parquet, HDFS, Apache Spark, tibble, data.table, etc.) to only have to build things they're already using. 

     

Other background information

I took the time to write this tonight after talking a colleague through the issues feather (R package) users experienced after the tibble 2.0 release. See for example [wesm/feather#374|[https://github.com/wesm/feather/issues/374]] and [wesm/feather#372|[https://github.com/wesm/feather/issues/37|https://github.com/wesm/feather/issues/374]2]. When tibble 2.0 came out it broke feather 0.3.1 and the maintainers there promptly released to CRAN a feather 0.3.2 which was compatible with tibble 2.0+. Unfortunately, this still caused disruptions for many people using feather (who inadvertently had tibble upgraded as part of installing other packages which depended on it). Nothing about tibble was necessary to the implementation of read_feather(), as far as I can tell, but this design choice made installing and upgrading tibble non-optional for developers who just wanted to use the feather file format and all it's awesome features.

 

If the proposal here is accepted, I hope it will mean we can prevent repeating the same experience with the R arrow package and set a strong precedent for developers who want to add compatibility in this package for other members of the ecosystem like parquet or Apache Spark.

 

 

Thank you for hearing me out!

 

 

 

Reporter: James Lamb / @jameslamb
Assignee: Romain Francois / @romainfrancois

PRs and other links:

Note: This issue was originally created as ARROW-5190. Please see the migration documentation for further details.

@asfimport
Copy link
Author

James Lamb / @jameslamb:
Thanks @romainfrancois!!!

@asfimport
Copy link
Author

Romain Francois / @romainfrancois:
Issue resolved by pull request 4454
#4454

@asfimport asfimport added this to the 0.14.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants