Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34775: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame #35173

Merged
merged 19 commits into from
May 3, 2023

Conversation

thisisnic
Copy link
Member

@thisisnic thisisnic commented Apr 17, 2023

Features of this PR:

  • Ensures that calling as.data.frame() on Arrow objects returns base R data.frame objects.

  • Drops the class attribute metadata of input objects of data.frame class (i.e. that don't have inherit from any additional classes other than data.frame). This results in us sacrificing roundtrip class fidelity for data.frame objects (i.e. if we input a base R data.frame, convert it to an Arrow Table, and then convert it back to R, we get a tibble). However, we now have consistency in the type of returned objects, retain roundtrip fidelity for other (non-class) metadata, and guarantee that as.data.frame() returns a base R data.frame. Users who wish to input and return a data.frame object can call as.data.frame() on the returned object.

  • Implements dplyr::collect() for StructArrays so that these objects can still be returned as tibbles if needed.

  • Renames expect_data_frame() to expect_equal_data_frame() for clarity, and updates it to convert both the object and expected object to data.frames.

  • Closes: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame #34775

@github-actions
Copy link

@thisisnic thisisnic marked this pull request as ready for review April 25, 2023 12:54
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! It seems like a cleaner solution than what we currently have. I like the idea of dropping metadata on the way in where possible because I seem to remember that we can skip some calls from C++ into R if there is no metadata to restore which speeds things up a bit.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels May 3, 2023
@thisisnic thisisnic merged commit 205ceb9 into apache:main May 3, 2023
12 of 13 checks passed
@ursabot
Copy link

ursabot commented May 4, 2023

Benchmark runs are scheduled for baseline = 2ee0345 and contender = 205ceb9. 205ceb9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️2.27% ⬆️0.06%] test-mac-arm
[Finished ⬇️3.32% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.66% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 205ceb99 ec2-t3-xlarge-us-east-2
[Finished] 205ceb99 test-mac-arm
[Finished] 205ceb99 ursa-i9-9960x
[Finished] 205ceb99 ursa-thinkcentre-m75q
[Finished] 2ee03450 ec2-t3-xlarge-us-east-2
[Finished] 2ee03450 test-mac-arm
[Finished] 2ee03450 ursa-i9-9960x
[Finished] 2ee03450 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 4, 2023

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…tbl and sometimes a data.frame (apache#35173)

Features of this PR:
* Ensures that calling `as.data.frame()` on Arrow objects returns base R `data.frame` objects.
* Drops the `class` attribute metadata of input objects of `data.frame` class (i.e. that don't have inherit from any additional classes other than `data.frame`).  This results in us sacrificing roundtrip class fidelity for `data.frame` objects (i.e. if we input a base R data.frame, convert it to an Arrow Table, and then convert it back to R, we get a tibble).  However, we now have consistency in the type of returned objects, retain roundtrip fidelity for other (non-class) metadata, and guarantee that `as.data.frame()` returns a base R data.frame.  Users who wish to input and return a `data.frame` object can call `as.data.frame()` on the returned object.
* Implements `dplyr::collect()` for StructArrays so that these objects can still be returned as tibbles if needed.
* Renames `expect_data_frame()` to `expect_equal_data_frame()` for clarity, and updates it to convert both the object and expected object to data.frames.

* Closes: apache#34775

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…tbl and sometimes a data.frame (apache#35173)

Features of this PR:
* Ensures that calling `as.data.frame()` on Arrow objects returns base R `data.frame` objects.
* Drops the `class` attribute metadata of input objects of `data.frame` class (i.e. that don't have inherit from any additional classes other than `data.frame`).  This results in us sacrificing roundtrip class fidelity for `data.frame` objects (i.e. if we input a base R data.frame, convert it to an Arrow Table, and then convert it back to R, we get a tibble).  However, we now have consistency in the type of returned objects, retain roundtrip fidelity for other (non-class) metadata, and guarantee that `as.data.frame()` returns a base R data.frame.  Users who wish to input and return a `data.frame` object can call `as.data.frame()` on the returned object.
* Implements `dplyr::collect()` for StructArrays so that these objects can still be returned as tibbles if needed.
* Renames `expect_data_frame()` to `expect_equal_data_frame()` for clarity, and updates it to convert both the object and expected object to data.frames.

* Closes: apache#34775

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…tbl and sometimes a data.frame (apache#35173)

Features of this PR:
* Ensures that calling `as.data.frame()` on Arrow objects returns base R `data.frame` objects.
* Drops the `class` attribute metadata of input objects of `data.frame` class (i.e. that don't have inherit from any additional classes other than `data.frame`).  This results in us sacrificing roundtrip class fidelity for `data.frame` objects (i.e. if we input a base R data.frame, convert it to an Arrow Table, and then convert it back to R, we get a tibble).  However, we now have consistency in the type of returned objects, retain roundtrip fidelity for other (non-class) metadata, and guarantee that `as.data.frame()` returns a base R data.frame.  Users who wish to input and return a `data.frame` object can call `as.data.frame()` on the returned object.
* Implements `dplyr::collect()` for StructArrays so that these objects can still be returned as tibbles if needed.
* Renames `expect_data_frame()` to `expect_equal_data_frame()` for clarity, and updates it to convert both the object and expected object to data.frames.

* Closes: apache#34775

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@thisisnic thisisnic modified the milestone: 13.0.0 Aug 3, 2023
@assignUser assignUser added the Breaking Change Includes a breaking change to the API label Aug 7, 2023
@ianmcook ianmcook mentioned this pull request Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting merge Awaiting merge Breaking Change Includes a breaking change to the API Component: Parquet Component: R
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame
4 participants