Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6231: [C++] Allow generating CSV column names #5206

Closed

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Aug 27, 2019

Add an option that, if enabled, will autogenerate column names
for the target table rather than read them from the CSV file.

Add an option that, if enabled, will autogenerate column names
for the target table rather than read them from the CSV file.
@pitrou
Copy link
Member Author

pitrou commented Aug 27, 2019

@nealrichardson @jorisvandenbossche thoughts?

@nealrichardson
Copy link
Contributor

If true, column names will be of the form "f0", "f1"...

Why "f"? FTR, here's how it looks in R:

> cat("1,2,3\n4,5,6\n", file="test.csv")
> read.csv("test.csv", header=FALSE)
  V1 V2 V3
1  1  2  3
2  4  5  6
> readr::read_csv("test.csv", col_names=FALSE)
Parsed with column specification:
cols(
  X1 = col_double(),
  X2 = col_double(),
  X3 = col_double()
)
# A tibble: 2 x 3
     X1    X2    X3
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

@pitrou
Copy link
Member Author

pitrou commented Aug 27, 2019

"f" for field.

@wesm
Copy link
Member

wesm commented Aug 27, 2019

Does this patch address the usability problem with header_rows=0? I think it might make more sense to assign names by default in such case than to raise an exception.

@pitrou
Copy link
Member Author

pitrou commented Aug 27, 2019

header_rows doesn't exist anymore. Instead we have an unrelated skip_rows.

@nealrichardson
Copy link
Contributor

Yeah, the "usability problem" that came up on Stack Overflow was already addressed, just after 0.14 was released.

@wesm
Copy link
Member

wesm commented Aug 27, 2019

Okay, so in R we have this code

read.csv("test.csv", header=FALSE)

and in pandas

In [14]: buf = io.StringIO('1,2,3\n4,5,6')                                                                                                                                                     

In [15]: pandas.read_csv(buf, header=None)                                                                                                                                                     
Out[15]: 
   0  1  2
0  1  2  3
1  4  5  6

Do we want to have a header option here or only use the autogenerate option?

@pitrou
Copy link
Member Author

pitrou commented Aug 27, 2019

What kind of "header option" are you looking for? If the user wants to change the autogenerated column names, they can probably do it on the table.

@pitrou
Copy link
Member Author

pitrou commented Aug 27, 2019

Or are we talking about API ergonomics? Feel free to propose a different API, as I'm not sure what it would look like :-)

@jorisvandenbossche
Copy link
Member

As far as I understand, a header option would be exactly the same as what you implemented, but just a different name?
header=True would indicate the file contains a header and the column names are read from the first line (after skip_rows), and header=False would mean that the file does not contain a header and the column names are auto-generated.

Given the usage of "header" in R (and in pandas, although with a slightly different meaning) is maybe a reason to use that instead of autogenerate_column_names. Although autogenerate_column_names is more explicit.

@pitrou
Copy link
Member Author

pitrou commented Aug 27, 2019

I prefer the more explicit name personally.

@nealrichardson
Copy link
Contributor

I don't have a strong opinion about the argument name. I expect that the R, Python, and whatever other libraries would expose an interface that matches the expectations of their users, and internally map the arguments to whatever they're called in the C++ library.

I do think the "f"-for-field prefix is unexpected, but 🤷‍♂

@pitrou
Copy link
Member Author

pitrou commented Aug 27, 2019

I do think the "f"-for-field prefix is unexpected, but

What would be more expected, according to you?

@nealrichardson
Copy link
Contributor

Well, in the R examples I pasted, the prefix character is V (for variable) or X, so that's what I expect, just based on what I know. I won't deny that that's arbitrary.

"f" doesn't resonate for me because I don't think of the columns in the table as "fields", though I understand where you're taking that from. Looking at ARROW-4511, "field" doesn't show up in the terminology section. And "record batch" later is explained as "a collection of independent arrays each having the same length as one another but potentially different data types." So "a" for array might be a better choice, or "v" for vector since the doc says that array and vector are interchangeable (and "v" is a variable prefix used elsewhere).

I'm curious what others think.

@codecov-io
Copy link

Codecov Report

Merging #5206 into master will decrease coverage by 22.93%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5206       +/-   ##
===========================================
- Coverage   88.71%   65.77%   -22.94%     
===========================================
  Files         934      554      -380     
  Lines      121320    71118    -50202     
  Branches     1437        0     -1437     
===========================================
- Hits       107627    46778    -60849     
- Misses      13331    24340    +11009     
+ Partials      362        0      -362
Impacted Files Coverage Δ
cpp/src/arrow/csv/options.h 100% <ø> (ø) ⬆️
cpp/src/arrow/csv/reader.cc 94.82% <100%> (+0.23%) ⬆️
python/pyarrow/tests/test_csv.py 99.1% <100%> (+0.04%) ⬆️
python/pyarrow/_csv.pyx 99.28% <100%> (+0.02%) ⬆️
cpp/src/arrow/util/memory.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/date_utils.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/memory.cc 0% <0%> (-100%) ⬇️
cpp/src/gandiva/decimal_type_util.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/filesystem/util_internal.cc 0% <0%> (-100%) ⬇️
cpp/src/arrow/compute/logical_type.h 0% <0%> (-100%) ⬇️
... and 630 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cef82b4...cc7fdc6. Read the comment docs.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 28, 2019

Personally, I wouldnt use "v", as we don't use the term vector in the C++ project. I would also not use "a" as a column/field is more than just the array. So I would rather go for "f" (field) or "c" (column) (with a preference for "c" as column is a more familiar term), or for something generic without meaning like "x".

@bkietz
Copy link
Member

bkietz commented Aug 28, 2019

If there's this much disagreement, perhaps the cromulent solution is to provide autogenerate_column_names_prefix (default "f")

@pitrou
Copy link
Member Author

pitrou commented Aug 28, 2019

"cromulent" simply should be the hardcoded prefix.

@pitrou
Copy link
Member Author

pitrou commented Aug 29, 2019

Does anyone else have a strong feeling here? Like @jorisvandenbossche I think it should either be "f", "c" or "x". No personal preference.

@fsaintjacques
Copy link
Contributor

I vote for 'c', then 'f'.

@fsaintjacques
Copy link
Contributor

I also think that autogenerate_column_names_prefix would close this debate and let everyone be happy.

@pitrou
Copy link
Member Author

pitrou commented Aug 30, 2019

Creating dubious preferences to make everyone happy is a UI antipattern, though.
See https://ometer.com/preferences.html

In this case anyone can rename the columns after the fact if they want to. The primary goal here is to avoid failing.

@fsaintjacques
Copy link
Contributor

Then keep the 'f' as-is, and let's move on :). I can merge now if you're fine with it.

@pitrou
Copy link
Member Author

pitrou commented Aug 30, 2019

I'm fine with it personally :-)

@pitrou pitrou deleted the ARROW-6231-autogenerate-column-names branch August 30, 2019 16:57
kou pushed a commit that referenced this pull request Oct 17, 2019
…tion in Arrow 0.15

After reviewing the commentary on #5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea.  In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?).

So I've just made the exception clearer.  It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc.

Closes #5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits:

c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody
0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX
0134d5e <Rick Cobb> simplify headers handling
2dedfad <Rick Cobb> and now with a headers: string test
7655138 <Rick Cobb> Merge work that makes headers work more like CSV
1215bfe <Rick Cobb> truthy headers should work like CSV
c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new
a4934b6 <Rick Cobb> Should never have committed Gemfile.lock
4af0111 <Rick Cobb> Make a better error message than NoMethodError
0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work?

Authored-by: Rick Cobb <rick@grandrounds.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
…tion in Arrow 0.15

After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea.  In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?).

So I've just made the exception clearer.  It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc.

Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits:

c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody
0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX
0134d5e <Rick Cobb> simplify headers handling
2dedfad <Rick Cobb> and now with a headers: string test
7655138 <Rick Cobb> Merge work that makes headers work more like CSV
1215bfe <Rick Cobb> truthy headers should work like CSV
c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new
a4934b6 <Rick Cobb> Should never have committed Gemfile.lock
4af0111 <Rick Cobb> Make a better error message than NoMethodError
0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work?

Authored-by: Rick Cobb <rick@grandrounds.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
…tion in Arrow 0.15

After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea.  In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?).

So I've just made the exception clearer.  It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc.

Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits:

c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody
0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX
0134d5e <Rick Cobb> simplify headers handling
2dedfad <Rick Cobb> and now with a headers: string test
7655138 <Rick Cobb> Merge work that makes headers work more like CSV
1215bfe <Rick Cobb> truthy headers should work like CSV
c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new
a4934b6 <Rick Cobb> Should never have committed Gemfile.lock
4af0111 <Rick Cobb> Make a better error message than NoMethodError
0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work?

Authored-by: Rick Cobb <rick@grandrounds.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
…tion in Arrow 0.15

After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea.  In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?).

So I've just made the exception clearer.  It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc.

Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits:

c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody
0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX
0134d5e <Rick Cobb> simplify headers handling
2dedfad <Rick Cobb> and now with a headers: string test
7655138 <Rick Cobb> Merge work that makes headers work more like CSV
1215bfe <Rick Cobb> truthy headers should work like CSV
c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new
a4934b6 <Rick Cobb> Should never have committed Gemfile.lock
4af0111 <Rick Cobb> Make a better error message than NoMethodError
0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work?

Authored-by: Rick Cobb <rick@grandrounds.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants