Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37813: [R] add quoted_na argument to open_delim_dataset() #37828

Merged
merged 8 commits into from Sep 29, 2023

Conversation

thisisnic
Copy link
Member

@thisisnic thisisnic commented Sep 22, 2023

Rationale for this change

The open_delim_dataset() family of functions were implemented to have the same arguments as the read_delim_arrow() functions where possible, but quoted_na was missed.

What changes are included in this PR?

Adding quoted_na to those functions.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

This PR includes breaking changes to public APIs.

Empty strings in input datasets now default to being read in by open_delim_dataset() and its derivates as NAs and not empty string

@thisisnic thisisnic added the Breaking Change Includes a breaking change to the API label Sep 22, 2023
@thisisnic thisisnic changed the title GH-37813: [R] add quoted_na argument to open_delim_dataset() [WIP] GH-37813: [R] add quoted_na argument to open_delim_dataset() Sep 22, 2023
@thisisnic thisisnic marked this pull request as ready for review September 22, 2023 12:41
@thisisnic thisisnic marked this pull request as draft September 24, 2023 17:13
@thisisnic thisisnic marked this pull request as ready for review September 25, 2023 08:27
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a few opportunities for clarification but in general looks good to me!

@@ -253,7 +253,7 @@ test_that("readr parse options", {
tsv_dir,
partitioning = "part",
format = "text",
quo = "\"",
del = ","
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - because passing in a parameter as quo (not a valid parameter name) no longer constitutes a unique partial match to either the readr or arrow options, it is caught earlier on in input validation and so raises an "Unrecognized option" error instead of an "Ambiguous option" error. This input validation is something it would be nice to refactor in another PR, but I figured here we just want to replace quo with something else that takes us down the "Ambiguous option" path to stick with the spirit of the test.

Comment on lines 567 to 568
df <- data.frame(text = c("one", "two", "", "four"), num = 1:4)
write.csv(df, dst_file, row.names = FALSE, quote = FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as write("one\ntwo\nthree\n\nfour") (It might be a tiny bit clearer what's actually getting tested to just write the contents of the file or put it in a comment...I forget what all the arguments of write.csv() actually do!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, updated.

@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review awaiting merge Awaiting merge labels Sep 27, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 28, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 29, 2023
@thisisnic thisisnic merged commit 72c6497 into apache:main Sep 29, 2023
10 checks passed
@thisisnic thisisnic removed the awaiting change review Awaiting change review label Sep 29, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 72c6497.

There were 5 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…pache#37828)

### Rationale for this change

The `open_delim_dataset()` family of functions were implemented to have the same arguments as the `read_delim_arrow()` functions where possible, but `quoted_na` was missed.

### What changes are included in this PR?

Adding `quoted_na` to those functions.

### Are these changes tested?

Yes 

### Are there any user-facing changes?

Yes

**This PR includes breaking changes to public APIs.**

Empty strings in input datasets now default to being read in by `open_delim_dataset()` and its derivates as NAs and not empty string

* Closes: apache#37813

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…pache#37828)

### Rationale for this change

The `open_delim_dataset()` family of functions were implemented to have the same arguments as the `read_delim_arrow()` functions where possible, but `quoted_na` was missed.

### What changes are included in this PR?

Adding `quoted_na` to those functions.

### Are these changes tested?

Yes 

### Are there any user-facing changes?

Yes

**This PR includes breaking changes to public APIs.**

Empty strings in input datasets now default to being read in by `open_delim_dataset()` and its derivates as NAs and not empty string

* Closes: apache#37813

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#37828)

### Rationale for this change

The `open_delim_dataset()` family of functions were implemented to have the same arguments as the `read_delim_arrow()` functions where possible, but `quoted_na` was missed.

### What changes are included in this PR?

Adding `quoted_na` to those functions.

### Are these changes tested?

Yes 

### Are there any user-facing changes?

Yes

**This PR includes breaking changes to public APIs.**

Empty strings in input datasets now default to being read in by `open_delim_dataset()` and its derivates as NAs and not empty string

* Closes: apache#37813

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#37828)

### Rationale for this change

The `open_delim_dataset()` family of functions were implemented to have the same arguments as the `read_delim_arrow()` functions where possible, but `quoted_na` was missed.

### What changes are included in this PR?

Adding `quoted_na` to those functions.

### Are these changes tested?

Yes 

### Are there any user-facing changes?

Yes

**This PR includes breaking changes to public APIs.**

Empty strings in input datasets now default to being read in by `open_delim_dataset()` and its derivates as NAs and not empty string

* Closes: apache#37813

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change Includes a breaking change to the API Component: R
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] read_delim_arrow recognizes empty string "" as NA but open_dataset does not
2 participants