GH-37813: [R] add quoted_na argument to open_delim_dataset() #37828

thisisnic · 2023-09-22T01:37:03Z

Rationale for this change

The open_delim_dataset() family of functions were implemented to have the same arguments as the read_delim_arrow() functions where possible, but quoted_na was missed.

What changes are included in this PR?

Adding quoted_na to those functions.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

This PR includes breaking changes to public APIs.

Empty strings in input datasets now default to being read in by open_delim_dataset() and its derivates as NAs and not empty string

Closes: [R] read_delim_arrow recognizes empty string "" as NA but open_dataset does not #37813

paleolimbot

Possibly a few opportunities for clarification but in general looks good to me!

paleolimbot · 2023-09-27T14:08:35Z

r/tests/testthat/test-dataset-csv.R

@@ -253,7 +253,7 @@ test_that("readr parse options", {
      tsv_dir,
      partitioning = "part",
      format = "text",
-      quo = "\"",
+      del = ","


Is this change intentional?

Yep - because passing in a parameter as quo (not a valid parameter name) no longer constitutes a unique partial match to either the readr or arrow options, it is caught earlier on in input validation and so raises an "Unrecognized option" error instead of an "Ambiguous option" error. This input validation is something it would be nice to refactor in another PR, but I figured here we just want to replace quo with something else that takes us down the "Ambiguous option" path to stick with the spirit of the test.

paleolimbot · 2023-09-27T15:02:34Z

r/tests/testthat/test-dataset-csv.R

+  df <- data.frame(text = c("one", "two", "", "four"), num = 1:4)
+  write.csv(df, dst_file, row.names = FALSE, quote = FALSE)


Is this the same as write("one\ntwo\nthree\n\nfour") (It might be a tiny bit clearer what's actually getting tested to just write the contents of the file or put it in a comment...I forget what all the arguments of write.csv() actually do!)

Agreed, updated.

conbench-apache-arrow · 2023-09-30T02:45:54Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 72c6497.

There were 5 benchmark results indicating a performance regression:

Commit Run on ursa-i9-9960x at 2023-09-29 16:12:06Z
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-16, scale_factor=1
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-17, scale_factor=1
and 3 more (see the report linked below)

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

…pache#37828) ### Rationale for this change The `open_delim_dataset()` family of functions were implemented to have the same arguments as the `read_delim_arrow()` functions where possible, but `quoted_na` was missed. ### What changes are included in this PR? Adding `quoted_na` to those functions. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** Empty strings in input datasets now default to being read in by `open_delim_dataset()` and its derivates as NAs and not empty string * Closes: apache#37813 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

Add quoted_na param

d336163

thisisnic added the Breaking Change Includes a breaking change to the API label Sep 22, 2023

github-actions bot added Component: R awaiting committer review Awaiting committer review labels Sep 22, 2023

Add tests for quoted_na

ba6d55b

thisisnic changed the title ~~GH-37813: [R] add quoted_na argument to open_delim_dataset() [WIP]~~ GH-37813: [R] add quoted_na argument to open_delim_dataset() Sep 22, 2023

thisisnic marked this pull request as ready for review September 22, 2023 12:41

thisisnic requested a review from paleolimbot as a code owner September 22, 2023 12:41

thisisnic added 2 commits September 22, 2023 13:44

Empty-Commit

ba273a6

Update tests

07d47b4

thisisnic marked this pull request as draft September 24, 2023 17:13

Update docs

0cf0353

thisisnic marked this pull request as ready for review September 25, 2023 08:27

Linter appeasing

0ad8705

paleolimbot approved these changes Sep 27, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review awaiting merge Awaiting merge labels Sep 27, 2023

Simplify test

2d775f8

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 28, 2023

All hail the mighty linter

8a83ce2

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 29, 2023

thisisnic merged commit 72c6497 into apache:main Sep 29, 2023
10 checks passed

thisisnic removed the awaiting change review Awaiting change review label Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-37813: [R] add quoted_na argument to open_delim_dataset() #37828

GH-37813: [R] add quoted_na argument to open_delim_dataset() #37828

thisisnic commented Sep 22, 2023 •

edited

paleolimbot left a comment

paleolimbot Sep 27, 2023

thisisnic Sep 28, 2023

paleolimbot Sep 27, 2023

thisisnic Sep 28, 2023

conbench-apache-arrow bot commented Sep 30, 2023

		df <- data.frame(text = c("one", "two", "", "four"), num = 1:4)
		write.csv(df, dst_file, row.names = FALSE, quote = FALSE)

GH-37813: [R] add quoted_na argument to open_delim_dataset() #37828

GH-37813: [R] add quoted_na argument to open_delim_dataset() #37828

Conversation

thisisnic commented Sep 22, 2023 • edited

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

paleolimbot left a comment

Choose a reason for hiding this comment

paleolimbot Sep 27, 2023

Choose a reason for hiding this comment

thisisnic Sep 28, 2023

Choose a reason for hiding this comment

paleolimbot Sep 27, 2023

Choose a reason for hiding this comment

thisisnic Sep 28, 2023

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Sep 30, 2023

thisisnic commented Sep 22, 2023 •

edited