ARROW-15123: [R] CSV dataset file header read in as data #12152

thisisnic · 2022-01-14T08:32:05Z

No description provided.

github-actions · 2022-01-14T08:32:25Z

https://issues.apache.org/jira/browse/ARROW-15123

github-actions · 2022-01-14T08:32:27Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

paleolimbot

Just some tiny comments!

r/R/util.R

r/tests/testthat/test-dataset-csv.R

r/R/query-engine.R

jonkeane

This is great, thank you! A few comments:

We have the following in the read_csv_arrow() docs, should we have something in the dataset docs for that as well? Or a link back to these?

#' Note that if you are specifying column names, whether by `schema` or
#' `col_names`, and the CSV file has a header row that would otherwise be used
#' to idenfity column names, you'll need to add `skip = 1` to skip that row.

This is well beyond the scope of this ticket / jira, but this got me thinking about it: we wouldn't be able to support someone reading in a csv that has headers in one file ("the first" one in their conception, though I admit we don't (purposefully!) consistently read them in one order). This might be something worth mentioning in our docs (or the cookbook) that if you're going to cut up a file using head/tail/awk from one giant csv to many smaller ones, you're best off dropping the headers too (or including them in every new file)

thisisnic · 2022-01-14T15:38:53Z

@jonkeane Yeah, I think the documentation could be expanded generally - should be covered in the PR here #12083 and this topic generally should absolutely be covered in the cookbook.

nealrichardson · 2022-01-14T16:44:04Z

if you're going to cut up a file using head/tail/awk from one giant csv to many smaller ones

We should actually advise against that entirely: use open_dataset() %>% write_dataset() instead and partition meaningfully.

jonkeane · 2022-01-14T16:46:13Z

We should actually advise against that entirely: use open_dataset() %>% write_dataset() instead and partition meaningfully.

Yes, absolutely. I meant more: if someone has found themselves in that situation we should warn about that (and suggest this as a better alternative)

thisisnic · 2022-01-24T11:38:39Z

The cookbook has this ticket open for adding something on doing open_dataset()...write_dataset(): apache/arrow-cookbook#130.

Any more changes needed @jonkeane or can this be merged?

ursabot · 2022-01-26T14:20:55Z

Benchmark runs are scheduled for baseline = 0b95b62 and contender = 4582713. 4582713 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.26% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: R label Jan 14, 2022

thisisnic changed the title ~~ARROW-15123: [R] Schema order not respected and file header ignored~~ ARROW-15123: [R] CSV dataset file header read in as data Jan 14, 2022

thisisnic marked this pull request as draft January 14, 2022 08:51

thisisnic closed this Jan 14, 2022

thisisnic reopened this Jan 14, 2022

Add user-friendly error message and tests

78f2e61

thisisnic force-pushed the ARROW-15123_file_header branch from fe8cff6 to 78f2e61 Compare January 14, 2022 09:22

thisisnic added 2 commits January 14, 2022 10:00

Fix assignment

229d781

Fix weird interaction between rlang::abort and bad assignment

7f77adf

thisisnic marked this pull request as ready for review January 14, 2022 10:46

paleolimbot reviewed Jan 14, 2022

View reviewed changes

r/R/util.R Outdated Show resolved Hide resolved

r/tests/testthat/test-dataset-csv.R Outdated Show resolved Hide resolved

nealrichardson reviewed Jan 14, 2022

View reviewed changes

r/R/query-engine.R Outdated Show resolved Hide resolved

Relocate the tryCatch, fix test formatting

b63174b

jonkeane reviewed Jan 14, 2022

View reviewed changes

Use abort

1482931

thisisnic requested a review from jonkeane January 24, 2022 11:39

jonkeane closed this in 4582713 Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-15123: [R] CSV dataset file header read in as data #12152

ARROW-15123: [R] CSV dataset file header read in as data #12152

thisisnic commented Jan 14, 2022

github-actions bot commented Jan 14, 2022

github-actions bot commented Jan 14, 2022

paleolimbot left a comment

jonkeane left a comment

thisisnic commented Jan 14, 2022

nealrichardson commented Jan 14, 2022

jonkeane commented Jan 14, 2022

thisisnic commented Jan 24, 2022

ursabot commented Jan 26, 2022 •

edited

Loading

ARROW-15123: [R] CSV dataset file header read in as data #12152

ARROW-15123: [R] CSV dataset file header read in as data #12152

Conversation

thisisnic commented Jan 14, 2022

github-actions bot commented Jan 14, 2022

github-actions bot commented Jan 14, 2022

paleolimbot left a comment

Choose a reason for hiding this comment

jonkeane left a comment

Choose a reason for hiding this comment

thisisnic commented Jan 14, 2022

nealrichardson commented Jan 14, 2022

jonkeane commented Jan 14, 2022

thisisnic commented Jan 24, 2022

ursabot commented Jan 26, 2022 • edited Loading

ursabot commented Jan 26, 2022 •

edited

Loading