Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #5609

Closed
wants to merge 10 commits into from
Closed

ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 #5609

wants to merge 10 commits into from

Conversation

cobbr2
Copy link
Contributor

@cobbr2 cobbr2 commented Oct 8, 2019

After reviewing the commentary on #5206 , and examining the new read options, I've concluded that supporting a boolean :headers option at this level is not a good idea. In particular, all the interpretations of headers: false that I could think of would be surprising to a typical Ruby CSV programmer (does it mean :autogenerate_column_names? are you required to add :column_names instead?).

So I've just made the exception clearer. It would be great if there were a way to pull the documentation of the options over from the .py files into the Ruby doc.

@cobbr2 cobbr2 marked this pull request as ready for review October 9, 2019 01:42
@kou kou changed the title [ARROW-6813] [RUBY] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 ARROW-6813: [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 Oct 9, 2019
@github-actions
Copy link

github-actions bot commented Oct 9, 2019

@cobbr2
Copy link
Contributor Author

cobbr2 commented Oct 9, 2019

@kou : How do I move this forward? This is my first PR for Arrow, so I don't know what's next in the process.

@wesm
Copy link
Member

wesm commented Oct 9, 2019

Someone should be by to code review, so that will be the next step

@kou
Copy link
Member

kou commented Oct 9, 2019

The :headers option exists for compatibility of the Ruby's csv library: https://github.com/ruby/csv
The CSV loader in Red Arrow uses Arrow C++'s CSV reader if we can use.
Otherwise, The CSV loader falls back to the Ruby's csv library. If we raise an exception for the :headers option, we can use the :headers option with the Ruby's csv library.
So we should not remove the option.

@kou
Copy link
Member

kou commented Oct 9, 2019

How about this?

diff --git a/ruby/red-arrow/lib/arrow/csv-loader.rb b/ruby/red-arrow/lib/arrow/csv-loader.rb
index c63ab89eb..43a5b39ad 100644
--- a/ruby/red-arrow/lib/arrow/csv-loader.rb
+++ b/ruby/red-arrow/lib/arrow/csv-loader.rb
@@ -94,9 +94,9 @@ module Arrow
         case key
         when :headers
           if value
-            options.n_header_rows = 1
+            options.generate_column_names = false
           else
-            options.n_header_rows = 0
+            options.generate_column_names = true
           end
         when :column_types
           value.each do |name, type|
diff --git a/ruby/red-arrow/test/test-csv-loader.rb b/ruby/red-arrow/test/test-csv-loader.rb
index 96de9c863..093712083 100644
--- a/ruby/red-arrow/test/test-csv-loader.rb
+++ b/ruby/red-arrow/test/test-csv-loader.rb
@@ -121,6 +121,29 @@ class CSVLoaderTest < Test::Unit::TestCase
       Arrow::CSVLoader.load(data, options)
     end
 
+    sub_test_case(":headers") do
+      test("true") do
+        values = Arrow::StringArray.new(["a", "b", "c"])
+        assert_equal(Arrow::Table.new(value: values),
+                     load_csv(<<-CSV, headers: true))
+value
+a
+b
+c
+                     CSV
+      end
+
+      test("false") do
+        values = Arrow::StringArray.new(["a", "b", "c"])
+        assert_equal(Arrow::Table.new(f0: values),
+                     load_csv(<<-CSV, headers: false))
+a
+b
+c
+                     CSV
+      end
+    end
+
     test(":column_types") do
       assert_equal(Arrow::Table.new(:count => Arrow::UInt16Array.new([1, 2, 4])),
                    load_csv(<<-CSV, column_types: {count: :uint16}))

@cobbr2
Copy link
Contributor Author

cobbr2 commented Oct 11, 2019

I'll give that a run in 18 hours or so, I think it's OK, but want to run a test.

If the philosophy is to try to match Ruby's CSV.new, there are a couple of additional situations to support (all of which are possible, but more work). From https://ruby-doc.org/stdlib-2.6.3/libdoc/csv/rdoc/CSV.html#method-c-new:

If set to :first_row or true, the initial row of the CSV file will be treated as a row of headers. If set to an Array, the contents will be used as the headers. If set to a String, the String is run through a call of ::parse_line with the same :col_sep, :row_sep, and :quote_char as this instance to produce an Array of headers....

I see easily how to do the Array argument; is there any reason to do String, or will we always revert to Ruby's CSV in that case?

@cobbr2
Copy link
Contributor Author

cobbr2 commented Oct 11, 2019

First: you're right, and thank you for the feedback. I think that interpreting String or Array values as if they were true, however, is incorrect, so I took the liberty to add interpretations of them, too. At the moment, it's missing a test case for headers: "some, csv, string", but that's as far as I can take the work tonight.

@cobbr2
Copy link
Contributor Author

cobbr2 commented Oct 11, 2019

@kou : Should be ready for your review again now. I'm not sure whether you'd prefer to collapse the truthy / true / :first_line cases or not; I kept them separate because they are documented separately (and in the case of truthy and falsey, not at all) in the CSV documentation.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:first_line is a historical value. We can treat it as truthy.

ruby/red-arrow/lib/arrow/csv-loader.rb Outdated Show resolved Hide resolved
ruby/red-arrow/lib/arrow/csv-loader.rb Outdated Show resolved Hide resolved
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Thanks!

@kou kou closed this in a81db80 Oct 17, 2019
kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
…tion in Arrow 0.15

After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea.  In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?).

So I've just made the exception clearer.  It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc.

Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits:

c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody
0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX
0134d5e <Rick Cobb> simplify headers handling
2dedfad <Rick Cobb> and now with a headers: string test
7655138 <Rick Cobb> Merge work that makes headers work more like CSV
1215bfe <Rick Cobb> truthy headers should work like CSV
c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new
a4934b6 <Rick Cobb> Should never have committed Gemfile.lock
4af0111 <Rick Cobb> Make a better error message than NoMethodError
0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work?

Authored-by: Rick Cobb <rick@grandrounds.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
…tion in Arrow 0.15

After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea.  In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?).

So I've just made the exception clearer.  It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc.

Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits:

c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody
0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX
0134d5e <Rick Cobb> simplify headers handling
2dedfad <Rick Cobb> and now with a headers: string test
7655138 <Rick Cobb> Merge work that makes headers work more like CSV
1215bfe <Rick Cobb> truthy headers should work like CSV
c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new
a4934b6 <Rick Cobb> Should never have committed Gemfile.lock
4af0111 <Rick Cobb> Make a better error message than NoMethodError
0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work?

Authored-by: Rick Cobb <rick@grandrounds.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
kszucs pushed a commit to kszucs/arrow that referenced this pull request Oct 21, 2019
…tion in Arrow 0.15

After reviewing the commentary on apache#5206 , and examining the new read options, I've concluded that supporting a boolean `:headers` option at this level is not a good idea.  In particular, all the interpretations of `headers: false` that I could think of would be surprising to a typical Ruby CSV programmer (does it mean `:autogenerate_column_names`? are you required to add `:column_names` instead?).

So I've just made the exception clearer.  It would be great if there were a way to pull the [documentation of the options](https://github.com/apache/arrow/blob/02d1e9736808d9a9624bef5577c880d8c165e853/python/pyarrow/_csv.pyx#L42) over from the `.py` files into the Ruby doc.

Closes apache#5609 from cobbr2/bug/ARROW-6813-ruby-headers-option-fail and squashes the following commits:

c941b10 <Rick Cobb> Retry the docker-compose build now that I see it succeed for anybody
0d67b65 <Rick Cobb> Retry the docker-compose build, can't find button in Github UX
0134d5e <Rick Cobb> simplify headers handling
2dedfad <Rick Cobb> and now with a headers: string test
7655138 <Rick Cobb> Merge work that makes headers work more like CSV
1215bfe <Rick Cobb> truthy headers should work like CSV
c7fe77f <Rick Cobb> Make `headers:` compatible with Ruby's CSV.new
a4934b6 <Rick Cobb> Should never have committed Gemfile.lock
4af0111 <Rick Cobb> Make a better error message than NoMethodError
0341652 <Rick Cobb> WIP: is this the correct test... and will skip_rows work?

Authored-by: Rick Cobb <rick@grandrounds.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants