Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15123: [R] CSV dataset file header read in as data #12152

Closed
wants to merge 5 commits into from

Conversation

thisisnic
Copy link
Member

No description provided.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@thisisnic thisisnic changed the title ARROW-15123: [R] Schema order not respected and file header ignored ARROW-15123: [R] CSV dataset file header read in as data Jan 14, 2022
@thisisnic thisisnic marked this pull request as draft January 14, 2022 08:51
@thisisnic thisisnic closed this Jan 14, 2022
@thisisnic thisisnic reopened this Jan 14, 2022
@thisisnic thisisnic marked this pull request as ready for review January 14, 2022 10:46
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some tiny comments!

r/R/util.R Outdated Show resolved Hide resolved
r/tests/testthat/test-dataset-csv.R Outdated Show resolved Hide resolved
r/R/query-engine.R Outdated Show resolved Hide resolved
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you! A few comments:

We have the following in the read_csv_arrow() docs, should we have something in the dataset docs for that as well? Or a link back to these?

#' Note that if you are specifying column names, whether by `schema` or
#' `col_names`, and the CSV file has a header row that would otherwise be used
#' to idenfity column names, you'll need to add `skip = 1` to skip that row.

This is well beyond the scope of this ticket / jira, but this got me thinking about it: we wouldn't be able to support someone reading in a csv that has headers in one file ("the first" one in their conception, though I admit we don't (purposefully!) consistently read them in one order). This might be something worth mentioning in our docs (or the cookbook) that if you're going to cut up a file using head/tail/awk from one giant csv to many smaller ones, you're best off dropping the headers too (or including them in every new file)

@thisisnic
Copy link
Member Author

@jonkeane Yeah, I think the documentation could be expanded generally - should be covered in the PR here #12083 and this topic generally should absolutely be covered in the cookbook.

@nealrichardson
Copy link
Member

if you're going to cut up a file using head/tail/awk from one giant csv to many smaller ones

We should actually advise against that entirely: use open_dataset() %>% write_dataset() instead and partition meaningfully.

@jonkeane
Copy link
Member

We should actually advise against that entirely: use open_dataset() %>% write_dataset() instead and partition meaningfully.

Yes, absolutely. I meant more: if someone has found themselves in that situation we should warn about that (and suggest this as a better alternative)

@thisisnic
Copy link
Member Author

The cookbook has this ticket open for adding something on doing open_dataset()...write_dataset(): apache/arrow-cookbook#130.

Any more changes needed @jonkeane or can this be merged?

@ursabot
Copy link

ursabot commented Jan 26, 2022

Benchmark runs are scheduled for baseline = 0b95b62 and contender = 4582713. 4582713 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.26% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants