-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15123: [R] CSV dataset file header read in as data #12152
Conversation
|
fe8cff6
to
78f2e61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some tiny comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you! A few comments:
We have the following in the read_csv_arrow()
docs, should we have something in the dataset docs for that as well? Or a link back to these?
#' Note that if you are specifying column names, whether by `schema` or
#' `col_names`, and the CSV file has a header row that would otherwise be used
#' to idenfity column names, you'll need to add `skip = 1` to skip that row.
This is well beyond the scope of this ticket / jira, but this got me thinking about it: we wouldn't be able to support someone reading in a csv that has headers in one file ("the first" one in their conception, though I admit we don't (purposefully!) consistently read them in one order). This might be something worth mentioning in our docs (or the cookbook) that if you're going to cut up a file using head/tail/awk from one giant csv to many smaller ones, you're best off dropping the headers too (or including them in every new file)
We should actually advise against that entirely: use |
Yes, absolutely. I meant more: if someone has found themselves in that situation we should warn about that (and suggest this as a better alternative) |
The cookbook has this ticket open for adding something on doing Any more changes needed @jonkeane or can this be merged? |
Benchmark runs are scheduled for baseline = 0b95b62 and contender = 4582713. 4582713 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.