ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #5609

cobbr2 · 2019-10-08T23:46:53Z

After reviewing the commentary on #5206 , and examining the new read options, I've concluded that supporting a boolean :headers option at this level is not a good idea. In particular, all the interpretations of headers: false that I could think of would be surprising to a typical Ruby CSV programmer (does it mean :autogenerate_column_names? are you required to add :column_names instead?).

So I've just made the exception clearer. It would be great if there were a way to pull the documentation of the options over from the .py files into the Ruby doc.

github-actions · 2019-10-09T02:46:20Z

https://issues.apache.org/jira/browse/ARROW-6813

cobbr2 · 2019-10-09T16:59:11Z

@kou : How do I move this forward? This is my first PR for Arrow, so I don't know what's next in the process.

wesm · 2019-10-09T20:19:37Z

Someone should be by to code review, so that will be the next step

kou · 2019-10-09T21:15:06Z

The :headers option exists for compatibility of the Ruby's csv library: https://github.com/ruby/csv
The CSV loader in Red Arrow uses Arrow C++'s CSV reader if we can use.
Otherwise, The CSV loader falls back to the Ruby's csv library. If we raise an exception for the :headers option, we can use the :headers option with the Ruby's csv library.
So we should not remove the option.

kou · 2019-10-09T21:24:20Z

How about this?

diff --git a/ruby/red-arrow/lib/arrow/csv-loader.rb b/ruby/red-arrow/lib/arrow/csv-loader.rb
index c63ab89eb..43a5b39ad 100644
--- a/ruby/red-arrow/lib/arrow/csv-loader.rb
+++ b/ruby/red-arrow/lib/arrow/csv-loader.rb
@@ -94,9 +94,9 @@ module Arrow
         case key
         when :headers
           if value
-            options.n_header_rows = 1
+            options.generate_column_names = false
           else
-            options.n_header_rows = 0
+            options.generate_column_names = true
           end
         when :column_types
           value.each do |name, type|
diff --git a/ruby/red-arrow/test/test-csv-loader.rb b/ruby/red-arrow/test/test-csv-loader.rb
index 96de9c863..093712083 100644
--- a/ruby/red-arrow/test/test-csv-loader.rb
+++ b/ruby/red-arrow/test/test-csv-loader.rb
@@ -121,6 +121,29 @@ class CSVLoaderTest < Test::Unit::TestCase
       Arrow::CSVLoader.load(data, options)
     end
 
+    sub_test_case(":headers") do
+      test("true") do
+        values = Arrow::StringArray.new(["a", "b", "c"])
+        assert_equal(Arrow::Table.new(value: values),
+                     load_csv(<<-CSV, headers: true))
+value
+a
+b
+c
+                     CSV
+      end
+
+      test("false") do
+        values = Arrow::StringArray.new(["a", "b", "c"])
+        assert_equal(Arrow::Table.new(f0: values),
+                     load_csv(<<-CSV, headers: false))
+a
+b
+c
+                     CSV
+      end
+    end
+
     test(":column_types") do
       assert_equal(Arrow::Table.new(:count => Arrow::UInt16Array.new([1, 2, 4])),
                    load_csv(<<-CSV, column_types: {count: :uint16}))

cobbr2 · 2019-10-11T01:14:02Z

I'll give that a run in 18 hours or so, I think it's OK, but want to run a test.

If the philosophy is to try to match Ruby's CSV.new, there are a couple of additional situations to support (all of which are possible, but more work). From https://ruby-doc.org/stdlib-2.6.3/libdoc/csv/rdoc/CSV.html#method-c-new:

If set to :first_row or true, the initial row of the CSV file will be treated as a row of headers. If set to an Array, the contents will be used as the headers. If set to a String, the String is run through a call of ::parse_line with the same :col_sep, :row_sep, and :quote_char as this instance to produce an Array of headers....

I see easily how to do the Array argument; is there any reason to do String, or will we always revert to Ruby's CSV in that case?

cobbr2 · 2019-10-11T08:20:23Z

First: you're right, and thank you for the feedback. I think that interpreting String or Array values as if they were true, however, is incorrect, so I took the liberty to add interpretations of them, too. At the moment, it's missing a test case for headers: "some, csv, string", but that's as far as I can take the work tonight.

cobbr2 · 2019-10-11T20:04:06Z

@kou : Should be ready for your review again now. I'm not sure whether you'd prefer to collapse the truthy / true / :first_line cases or not; I kept them separate because they are documented separately (and in the case of truthy and falsey, not at all) in the CSV documentation.

kou

:first_line is a historical value. We can treat it as truthy.

ruby/red-arrow/lib/arrow/csv-loader.rb

kou

+1
Thanks!

…tion in Arrow 0.15 After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea. In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?). So I've just made the exception clearer. It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc. Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits: c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody 0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX 0134d5e <Rick Cobb> simplify headers handling 2dedfad <Rick Cobb> and now with a headers: string test 7655138 <Rick Cobb> Merge work that makes headers work more like CSV 1215bfe <Rick Cobb> truthy headers should work like CSV c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new a4934b6 <Rick Cobb> Should never have committed Gemfile.lock 4af0111 <Rick Cobb> Make a better error message than NoMethodError 0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work? Authored-by: Rick Cobb <rick@grandrounds.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

cobbr2 added 3 commits October 8, 2019 12:20

WIP: is this the correct test... and will skip_rows work?

0341652

Make a better error message than NoMethodError

4af0111

Should never have committed Gemfile.lock

a4934b6

cobbr2 marked this pull request as ready for review October 9, 2019 01:42

kou changed the title ~~[ARROW-6813] [RUBY] Arrow::Table.load with headers=true leads to exception in Arrow 0.15~~ ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 Oct 9, 2019

cobbr2 added 3 commits October 11, 2019 00:56

Make headers: compatible with Ruby's CSV.new

c7fe77f

truthy headers should work like CSV

1215bfe

Merge work that makes headers work more like CSV

7655138

and now with a headers: string test

2dedfad

kou reviewed Oct 11, 2019

View reviewed changes

ruby/red-arrow/lib/arrow/csv-loader.rb Outdated Show resolved Hide resolved

ruby/red-arrow/lib/arrow/csv-loader.rb Outdated Show resolved Hide resolved

cobbr2 added 3 commits October 15, 2019 10:11

simplify headers handling

0134d5e

Retry the docker-compose build, can't find button in Github UX

0d67b65

Retry the docker-compose build now that I see it succeed for anybody

c941b10

kou approved these changes Oct 17, 2019

View reviewed changes

kou closed this in a81db80 Oct 17, 2019

asfimport mentioned this pull request Oct 18, 2019

[Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #23146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #5609

ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #5609

cobbr2 commented Oct 8, 2019 •

edited

Loading

github-actions bot commented Oct 9, 2019

cobbr2 commented Oct 9, 2019

wesm commented Oct 9, 2019

kou commented Oct 9, 2019

kou commented Oct 9, 2019

cobbr2 commented Oct 11, 2019

cobbr2 commented Oct 11, 2019

cobbr2 commented Oct 11, 2019

kou left a comment

kou left a comment

ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #5609

ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #5609

Conversation

cobbr2 commented Oct 8, 2019 • edited Loading

github-actions bot commented Oct 9, 2019

cobbr2 commented Oct 9, 2019

wesm commented Oct 9, 2019

kou commented Oct 9, 2019

kou commented Oct 9, 2019

cobbr2 commented Oct 11, 2019

cobbr2 commented Oct 11, 2019

cobbr2 commented Oct 11, 2019

kou left a comment

Choose a reason for hiding this comment

kou left a comment

Choose a reason for hiding this comment

cobbr2 commented Oct 8, 2019 •

edited

Loading