New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6231: [C++] Allow generating CSV column names #5206
Conversation
Add an option that, if enabled, will autogenerate column names for the target table rather than read them from the CSV file.
@nealrichardson @jorisvandenbossche thoughts? |
Why "f"? FTR, here's how it looks in R: > cat("1,2,3\n4,5,6\n", file="test.csv")
> read.csv("test.csv", header=FALSE)
V1 V2 V3
1 1 2 3
2 4 5 6
> readr::read_csv("test.csv", col_names=FALSE)
Parsed with column specification:
cols(
X1 = col_double(),
X2 = col_double(),
X3 = col_double()
)
# A tibble: 2 x 3
X1 X2 X3
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6 |
"f" for field. |
Does this patch address the usability problem with |
|
Yeah, the "usability problem" that came up on Stack Overflow was already addressed, just after 0.14 was released. |
Okay, so in R we have this code
and in pandas
Do we want to have a header option here or only use the autogenerate option? |
What kind of "header option" are you looking for? If the user wants to change the autogenerated column names, they can probably do it on the table. |
Or are we talking about API ergonomics? Feel free to propose a different API, as I'm not sure what it would look like :-) |
As far as I understand, a Given the usage of "header" in R (and in pandas, although with a slightly different meaning) is maybe a reason to use that instead of |
I prefer the more explicit name personally. |
I don't have a strong opinion about the argument name. I expect that the R, Python, and whatever other libraries would expose an interface that matches the expectations of their users, and internally map the arguments to whatever they're called in the C++ library. I do think the "f"-for-field prefix is unexpected, but 🤷♂ |
What would be more expected, according to you? |
Well, in the R examples I pasted, the prefix character is V (for variable) or X, so that's what I expect, just based on what I know. I won't deny that that's arbitrary. "f" doesn't resonate for me because I don't think of the columns in the table as "fields", though I understand where you're taking that from. Looking at ARROW-4511, "field" doesn't show up in the terminology section. And "record batch" later is explained as "a collection of independent arrays each having the same length as one another but potentially different data types." So "a" for array might be a better choice, or "v" for vector since the doc says that array and vector are interchangeable (and "v" is a variable prefix used elsewhere). I'm curious what others think. |
Codecov Report
@@ Coverage Diff @@
## master #5206 +/- ##
===========================================
- Coverage 88.71% 65.77% -22.94%
===========================================
Files 934 554 -380
Lines 121320 71118 -50202
Branches 1437 0 -1437
===========================================
- Hits 107627 46778 -60849
- Misses 13331 24340 +11009
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
Personally, I wouldnt use "v", as we don't use the term vector in the C++ project. I would also not use "a" as a column/field is more than just the array. So I would rather go for "f" (field) or "c" (column) (with a preference for "c" as column is a more familiar term), or for something generic without meaning like "x". |
If there's this much disagreement, perhaps the cromulent solution is to provide |
"cromulent" simply should be the hardcoded prefix. |
Does anyone else have a strong feeling here? Like @jorisvandenbossche I think it should either be "f", "c" or "x". No personal preference. |
I vote for 'c', then 'f'. |
I also think that |
Creating dubious preferences to make everyone happy is a UI antipattern, though. In this case anyone can rename the columns after the fact if they want to. The primary goal here is to avoid failing. |
Then keep the 'f' as-is, and let's move on :). I can merge now if you're fine with it. |
I'm fine with it personally :-) |
…tion in Arrow 0.15 After reviewing the commentary on #5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea. In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?). So I've just made the exception clearer. It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc. Closes #5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits: c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody 0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX 0134d5e <Rick Cobb> simplify headers handling 2dedfad <Rick Cobb> and now with a headers: string test 7655138 <Rick Cobb> Merge work that makes headers work more like CSV 1215bfe <Rick Cobb> truthy headers should work like CSV c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new a4934b6 <Rick Cobb> Should never have committed Gemfile.lock 4af0111 <Rick Cobb> Make a better error message than NoMethodError 0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work? Authored-by: Rick Cobb <rick@grandrounds.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…tion in Arrow 0.15 After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea. In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?). So I've just made the exception clearer. It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc. Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits: c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody 0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX 0134d5e <Rick Cobb> simplify headers handling 2dedfad <Rick Cobb> and now with a headers: string test 7655138 <Rick Cobb> Merge work that makes headers work more like CSV 1215bfe <Rick Cobb> truthy headers should work like CSV c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new a4934b6 <Rick Cobb> Should never have committed Gemfile.lock 4af0111 <Rick Cobb> Make a better error message than NoMethodError 0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work? Authored-by: Rick Cobb <rick@grandrounds.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…tion in Arrow 0.15 After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea. In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?). So I've just made the exception clearer. It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc. Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits: c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody 0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX 0134d5e <Rick Cobb> simplify headers handling 2dedfad <Rick Cobb> and now with a headers: string test 7655138 <Rick Cobb> Merge work that makes headers work more like CSV 1215bfe <Rick Cobb> truthy headers should work like CSV c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new a4934b6 <Rick Cobb> Should never have committed Gemfile.lock 4af0111 <Rick Cobb> Make a better error message than NoMethodError 0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work? Authored-by: Rick Cobb <rick@grandrounds.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…tion in Arrow 0.15 After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea. In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?). So I've just made the exception clearer. It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc. Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits: c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody 0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX 0134d5e <Rick Cobb> simplify headers handling 2dedfad <Rick Cobb> and now with a headers: string test 7655138 <Rick Cobb> Merge work that makes headers work more like CSV 1215bfe <Rick Cobb> truthy headers should work like CSV c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new a4934b6 <Rick Cobb> Should never have committed Gemfile.lock 4af0111 <Rick Cobb> Make a better error message than NoMethodError 0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work? Authored-by: Rick Cobb <rick@grandrounds.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Add an option that, if enabled, will autogenerate column names
for the target table rather than read them from the CSV file.