ARROW-6231: [C++] Allow generating CSV column names #5206

pitrou · 2019-08-27T13:42:32Z

Add an option that, if enabled, will autogenerate column names
for the target table rather than read them from the CSV file.

Add an option that, if enabled, will autogenerate column names for the target table rather than read them from the CSV file.

pitrou · 2019-08-27T13:44:02Z

@nealrichardson @jorisvandenbossche thoughts?

nealrichardson · 2019-08-27T14:51:13Z

If true, column names will be of the form "f0", "f1"...

Why "f"? FTR, here's how it looks in R:

> cat("1,2,3\n4,5,6\n", file="test.csv")
> read.csv("test.csv", header=FALSE)
  V1 V2 V3
1  1  2  3
2  4  5  6
> readr::read_csv("test.csv", col_names=FALSE)
Parsed with column specification:
cols(
  X1 = col_double(),
  X2 = col_double(),
  X3 = col_double()
)
# A tibble: 2 x 3
     X1    X2    X3
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

pitrou · 2019-08-27T14:54:35Z

"f" for field.

wesm · 2019-08-27T15:37:20Z

Does this patch address the usability problem with header_rows=0? I think it might make more sense to assign names by default in such case than to raise an exception.

pitrou · 2019-08-27T15:38:23Z

header_rows doesn't exist anymore. Instead we have an unrelated skip_rows.

nealrichardson · 2019-08-27T15:49:26Z

Yeah, the "usability problem" that came up on Stack Overflow was already addressed, just after 0.14 was released.

wesm · 2019-08-27T15:56:46Z

Okay, so in R we have this code

read.csv("test.csv", header=FALSE)

and in pandas

In [14]: buf = io.StringIO('1,2,3\n4,5,6')                                                                                                                                                     

In [15]: pandas.read_csv(buf, header=None)                                                                                                                                                     
Out[15]: 
   0  1  2
0  1  2  3
1  4  5  6

Do we want to have a header option here or only use the autogenerate option?

pitrou · 2019-08-27T16:00:41Z

What kind of "header option" are you looking for? If the user wants to change the autogenerated column names, they can probably do it on the table.

pitrou · 2019-08-27T16:01:14Z

Or are we talking about API ergonomics? Feel free to propose a different API, as I'm not sure what it would look like :-)

jorisvandenbossche · 2019-08-27T16:06:12Z

As far as I understand, a header option would be exactly the same as what you implemented, but just a different name?
header=True would indicate the file contains a header and the column names are read from the first line (after skip_rows), and header=False would mean that the file does not contain a header and the column names are auto-generated.

Given the usage of "header" in R (and in pandas, although with a slightly different meaning) is maybe a reason to use that instead of autogenerate_column_names. Although autogenerate_column_names is more explicit.

pitrou · 2019-08-27T16:26:02Z

I prefer the more explicit name personally.

nealrichardson · 2019-08-27T17:54:01Z

I don't have a strong opinion about the argument name. I expect that the R, Python, and whatever other libraries would expose an interface that matches the expectations of their users, and internally map the arguments to whatever they're called in the C++ library.

I do think the "f"-for-field prefix is unexpected, but 🤷‍♂

pitrou · 2019-08-27T18:43:55Z

I do think the "f"-for-field prefix is unexpected, but

What would be more expected, according to you?

nealrichardson · 2019-08-27T19:24:58Z

Well, in the R examples I pasted, the prefix character is V (for variable) or X, so that's what I expect, just based on what I know. I won't deny that that's arbitrary.

"f" doesn't resonate for me because I don't think of the columns in the table as "fields", though I understand where you're taking that from. Looking at ARROW-4511, "field" doesn't show up in the terminology section. And "record batch" later is explained as "a collection of independent arrays each having the same length as one another but potentially different data types." So "a" for array might be a better choice, or "v" for vector since the doc says that array and vector are interchangeable (and "v" is a variable prefix used elsewhere).

I'm curious what others think.

codecov-io · 2019-08-27T19:52:00Z

Codecov Report

Merging #5206 into master will decrease coverage by 22.93%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           master    #5206       +/-   ##
===========================================
- Coverage   88.71%   65.77%   -22.94%     
===========================================
  Files         934      554      -380     
  Lines      121320    71118    -50202     
  Branches     1437        0     -1437     
===========================================
- Hits       107627    46778    -60849     
- Misses      13331    24340    +11009     
+ Partials      362        0      -362

Impacted Files	Coverage Δ
cpp/src/arrow/csv/options.h	`100% <ø> (ø)`	⬆️
cpp/src/arrow/csv/reader.cc	`94.82% <100%> (+0.23%)`	⬆️
python/pyarrow/tests/test_csv.py	`99.1% <100%> (+0.04%)`	⬆️
python/pyarrow/_csv.pyx	`99.28% <100%> (+0.02%)`	⬆️
cpp/src/arrow/util/memory.h	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/date_utils.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/util/memory.cc	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/decimal_type_util.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/filesystem/util_internal.cc	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/compute/logical_type.h	`0% <0%> (-100%)`	⬇️
... and 630 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cef82b4...cc7fdc6. Read the comment docs.

jorisvandenbossche · 2019-08-28T07:47:55Z

Personally, I wouldnt use "v", as we don't use the term vector in the C++ project. I would also not use "a" as a column/field is more than just the array. So I would rather go for "f" (field) or "c" (column) (with a preference for "c" as column is a more familiar term), or for something generic without meaning like "x".

bkietz · 2019-08-28T14:34:21Z

If there's this much disagreement, perhaps the cromulent solution is to provide autogenerate_column_names_prefix (default "f")

pitrou · 2019-08-28T14:40:36Z

"cromulent" simply should be the hardcoded prefix.

pitrou · 2019-08-29T12:14:50Z

Does anyone else have a strong feeling here? Like @jorisvandenbossche I think it should either be "f", "c" or "x". No personal preference.

fsaintjacques · 2019-08-30T14:28:58Z

I vote for 'c', then 'f'.

fsaintjacques · 2019-08-30T14:30:03Z

I also think that autogenerate_column_names_prefix would close this debate and let everyone be happy.

pitrou · 2019-08-30T14:44:44Z

Creating dubious preferences to make everyone happy is a UI antipattern, though.
See https://ometer.com/preferences.html

In this case anyone can rename the columns after the fact if they want to. The primary goal here is to avoid failing.

fsaintjacques · 2019-08-30T14:46:20Z

Then keep the 'f' as-is, and let's move on :). I can merge now if you're fine with it.

pitrou · 2019-08-30T14:51:35Z

I'm fine with it personally :-)

…tion in Arrow 0.15 After reviewing the commentary on #5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea. In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?). So I've just made the exception clearer. It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc. Closes #5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits: c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody 0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX 0134d5e <Rick Cobb> simplify headers handling 2dedfad <Rick Cobb> and now with a headers: string test 7655138 <Rick Cobb> Merge work that makes headers work more like CSV 1215bfe <Rick Cobb> truthy headers should work like CSV c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new a4934b6 <Rick Cobb> Should never have committed Gemfile.lock 4af0111 <Rick Cobb> Make a better error message than NoMethodError 0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work? Authored-by: Rick Cobb <rick@grandrounds.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

…tion in Arrow 0.15 After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea. In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?). So I've just made the exception clearer. It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc. Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits: c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody 0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX 0134d5e <Rick Cobb> simplify headers handling 2dedfad <Rick Cobb> and now with a headers: string test 7655138 <Rick Cobb> Merge work that makes headers work more like CSV 1215bfe <Rick Cobb> truthy headers should work like CSV c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new a4934b6 <Rick Cobb> Should never have committed Gemfile.lock 4af0111 <Rick Cobb> Make a better error message than NoMethodError 0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work? Authored-by: Rick Cobb <rick@grandrounds.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

ARROW-6231: [C++] Allow generating CSV column names

cc7fdc6

Add an option that, if enabled, will autogenerate column names for the target table rather than read them from the CSV file.

fsaintjacques closed this in beea8f9 Aug 30, 2019

pitrou deleted the ARROW-6231-autogenerate-column-names branch August 30, 2019 16:57

cobbr2 mentioned this pull request Oct 8, 2019

ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #5609

Closed

asfimport mentioned this pull request Sep 2, 2019

[C++][Python] Consider assigning default column names when reading CSV file and header_rows=0 #22618

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6231: [C++] Allow generating CSV column names #5206

ARROW-6231: [C++] Allow generating CSV column names #5206

pitrou commented Aug 27, 2019

pitrou commented Aug 27, 2019

nealrichardson commented Aug 27, 2019

pitrou commented Aug 27, 2019

wesm commented Aug 27, 2019

pitrou commented Aug 27, 2019

nealrichardson commented Aug 27, 2019

wesm commented Aug 27, 2019

pitrou commented Aug 27, 2019

pitrou commented Aug 27, 2019 •

edited

jorisvandenbossche commented Aug 27, 2019

pitrou commented Aug 27, 2019

nealrichardson commented Aug 27, 2019

pitrou commented Aug 27, 2019

nealrichardson commented Aug 27, 2019

codecov-io commented Aug 27, 2019

jorisvandenbossche commented Aug 28, 2019 •

edited

bkietz commented Aug 28, 2019

pitrou commented Aug 28, 2019

pitrou commented Aug 29, 2019

fsaintjacques commented Aug 30, 2019

fsaintjacques commented Aug 30, 2019

pitrou commented Aug 30, 2019

fsaintjacques commented Aug 30, 2019

pitrou commented Aug 30, 2019

ARROW-6231: [C++] Allow generating CSV column names #5206

ARROW-6231: [C++] Allow generating CSV column names #5206

Conversation

pitrou commented Aug 27, 2019

pitrou commented Aug 27, 2019

nealrichardson commented Aug 27, 2019

pitrou commented Aug 27, 2019

wesm commented Aug 27, 2019

pitrou commented Aug 27, 2019

nealrichardson commented Aug 27, 2019

wesm commented Aug 27, 2019

pitrou commented Aug 27, 2019

pitrou commented Aug 27, 2019 • edited

jorisvandenbossche commented Aug 27, 2019

pitrou commented Aug 27, 2019

nealrichardson commented Aug 27, 2019

pitrou commented Aug 27, 2019

nealrichardson commented Aug 27, 2019

codecov-io commented Aug 27, 2019

Codecov Report

jorisvandenbossche commented Aug 28, 2019 • edited

bkietz commented Aug 28, 2019

pitrou commented Aug 28, 2019

pitrou commented Aug 29, 2019

fsaintjacques commented Aug 30, 2019

fsaintjacques commented Aug 30, 2019

pitrou commented Aug 30, 2019

fsaintjacques commented Aug 30, 2019

pitrou commented Aug 30, 2019

pitrou commented Aug 27, 2019 •

edited

jorisvandenbossche commented Aug 28, 2019 •

edited