-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #5609
ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #5609
Conversation
@kou : How do I move this forward? This is my first PR for Arrow, so I don't know what's next in the process. |
Someone should be by to code review, so that will be the next step |
The |
How about this? diff --git a/ruby/red-arrow/lib/arrow/csv-loader.rb b/ruby/red-arrow/lib/arrow/csv-loader.rb
index c63ab89eb..43a5b39ad 100644
--- a/ruby/red-arrow/lib/arrow/csv-loader.rb
+++ b/ruby/red-arrow/lib/arrow/csv-loader.rb
@@ -94,9 +94,9 @@ module Arrow
case key
when :headers
if value
- options.n_header_rows = 1
+ options.generate_column_names = false
else
- options.n_header_rows = 0
+ options.generate_column_names = true
end
when :column_types
value.each do |name, type|
diff --git a/ruby/red-arrow/test/test-csv-loader.rb b/ruby/red-arrow/test/test-csv-loader.rb
index 96de9c863..093712083 100644
--- a/ruby/red-arrow/test/test-csv-loader.rb
+++ b/ruby/red-arrow/test/test-csv-loader.rb
@@ -121,6 +121,29 @@ class CSVLoaderTest < Test::Unit::TestCase
Arrow::CSVLoader.load(data, options)
end
+ sub_test_case(":headers") do
+ test("true") do
+ values = Arrow::StringArray.new(["a", "b", "c"])
+ assert_equal(Arrow::Table.new(value: values),
+ load_csv(<<-CSV, headers: true))
+value
+a
+b
+c
+ CSV
+ end
+
+ test("false") do
+ values = Arrow::StringArray.new(["a", "b", "c"])
+ assert_equal(Arrow::Table.new(f0: values),
+ load_csv(<<-CSV, headers: false))
+a
+b
+c
+ CSV
+ end
+ end
+
test(":column_types") do
assert_equal(Arrow::Table.new(:count => Arrow::UInt16Array.new([1, 2, 4])),
load_csv(<<-CSV, column_types: {count: :uint16})) |
I'll give that a run in 18 hours or so, I think it's OK, but want to run a test. If the philosophy is to try to match Ruby's CSV.new, there are a couple of additional situations to support (all of which are possible, but more work). From https://ruby-doc.org/stdlib-2.6.3/libdoc/csv/rdoc/CSV.html#method-c-new:
I see easily how to do the Array argument; is there any reason to do String, or will we always revert to Ruby's CSV in that case? |
First: you're right, and thank you for the feedback. I think that interpreting String or Array values as if they were |
@kou : Should be ready for your review again now. I'm not sure whether you'd prefer to collapse the truthy / true / :first_line cases or not; I kept them separate because they are documented separately (and in the case of truthy and falsey, not at all) in the CSV documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:first_line
is a historical value. We can treat it as truthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks!
…tion in Arrow 0.15 After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea. In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?). So I've just made the exception clearer. It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc. Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits: c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody 0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX 0134d5e <Rick Cobb> simplify headers handling 2dedfad <Rick Cobb> and now with a headers: string test 7655138 <Rick Cobb> Merge work that makes headers work more like CSV 1215bfe <Rick Cobb> truthy headers should work like CSV c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new a4934b6 <Rick Cobb> Should never have committed Gemfile.lock 4af0111 <Rick Cobb> Make a better error message than NoMethodError 0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work? Authored-by: Rick Cobb <rick@grandrounds.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…tion in Arrow 0.15 After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea. In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?). So I've just made the exception clearer. It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc. Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits: c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody 0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX 0134d5e <Rick Cobb> simplify headers handling 2dedfad <Rick Cobb> and now with a headers: string test 7655138 <Rick Cobb> Merge work that makes headers work more like CSV 1215bfe <Rick Cobb> truthy headers should work like CSV c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new a4934b6 <Rick Cobb> Should never have committed Gemfile.lock 4af0111 <Rick Cobb> Make a better error message than NoMethodError 0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work? Authored-by: Rick Cobb <rick@grandrounds.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…tion in Arrow 0.15 After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea. In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?). So I've just made the exception clearer. It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc. Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits: c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody 0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX 0134d5e <Rick Cobb> simplify headers handling 2dedfad <Rick Cobb> and now with a headers: string test 7655138 <Rick Cobb> Merge work that makes headers work more like CSV 1215bfe <Rick Cobb> truthy headers should work like CSV c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new a4934b6 <Rick Cobb> Should never have committed Gemfile.lock 4af0111 <Rick Cobb> Make a better error message than NoMethodError 0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work? Authored-by: Rick Cobb <rick@grandrounds.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
After reviewing the commentary on #5206 , and examining the new read options, I've concluded that supporting a boolean
:headers
option at this level is not a good idea. In particular, all the interpretations ofheaders: false
that I could think of would be surprising to a typical Ruby CSV programmer (does it mean:autogenerate_column_names
? are you required to add:column_names
instead?).So I've just made the exception clearer. It would be great if there were a way to pull the documentation of the options over from the
.py
files into the Ruby doc.