Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41749: [GLib] Allow getting a RecordBatchReader from a Dataset or Scanner #41750

Merged
merged 9 commits into from
May 25, 2024

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented May 21, 2024

Rationale for this change

See #41749

What changes are included in this PR?

Adds to_reader methods to GADatasetDataset and GADatasetScanner.

Are these changes tested?

Yes I've added new unit tests.

Are there any user-facing changes?

Yes this is a new feature.

@adamreeve adamreeve requested a review from kou as a code owner May 21, 2024 07:43
Copy link

⚠️ GitHub issue #41749 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 21, 2024
@adamreeve adamreeve changed the title GH-41749: [GLib] Allow getting a RecordBatchReader from a Dataset or Dataset Scanner GH-41749: [GLib] Allow getting a RecordBatchReader from a Dataset or Scanner May 21, 2024
c_glib/test/dataset/test-file-system-dataset.rb Outdated Show resolved Hide resolved
c_glib/test/dataset/test-scanner.rb Outdated Show resolved Hide resolved
c_glib/arrow-dataset-glib/dataset.cpp Outdated Show resolved Hide resolved
c_glib/arrow-dataset-glib/dataset.cpp Outdated Show resolved Hide resolved
c_glib/arrow-dataset-glib/dataset.cpp Outdated Show resolved Hide resolved
c_glib/arrow-dataset-glib/scanner.cpp Outdated Show resolved Hide resolved
c_glib/arrow-dataset-glib/dataset.cpp Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 21, 2024
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 21, 2024
@adamreeve
Copy link
Contributor Author

adamreeve commented May 21, 2024

I'm not sure how I've broken the tests on MinGW, I'll need to figure out what's going wrong there.
These are failing with an error like:

Error: test_record_batch_reader(TestDatasetFileSystemDataset): Errno::ENOTEMPTY: Directory not empty @ dir_s_rmdir - D:/a/_temp/d20240521-4260-zvpgmi

I guess something is keeping a file open and preventing the directory contents from being deleted.

@kou
Copy link
Member

kou commented May 22, 2024

Could you try this?

diff --git a/c_glib/test/dataset/test-file-system-dataset.rb b/c_glib/test/dataset/test-file-system-dataset.rb
index a64022b618..d4bf567397 100644
--- a/c_glib/test/dataset/test-file-system-dataset.rb
+++ b/c_glib/test/dataset/test-file-system-dataset.rb
@@ -63,7 +63,11 @@ class TestDatasetFileSystemDataset < Test::Unit::TestCase
   def test_record_batch_reader
     dataset, expected_table = create_dataset
     reader = dataset.to_record_batch_reader
-    assert_equal(expected_table, reader.read_all)
+    begin
+      assert_equal(expected_table, reader.read_all)
+    ensure
+      reader.close
+    end
   end
 
   def create_dataset
diff --git a/c_glib/test/dataset/test-scanner.rb b/c_glib/test/dataset/test-scanner.rb
index 714760d6a3..1a50b7fe68 100644
--- a/c_glib/test/dataset/test-scanner.rb
+++ b/c_glib/test/dataset/test-scanner.rb
@@ -47,6 +47,11 @@ class TestDatasetScanner < Test::Unit::TestCase
   end
 
   def test_record_batch_reader
-    assert_equal(@table, @scanner.to_record_batch_reader.read_all)
+    reader = @scanner.to_record_batch_reader
+    begin
+      assert_equal(@table, reader.read_all)
+    ensure
+      reader.close
+    end
   end
 end

@adamreeve
Copy link
Contributor Author

adamreeve commented May 22, 2024

Could you try this?

That doesn't work because there aren't bindings for the Close method, but even if there were, it looks like this isn't sufficient to close the underlying files. I think this is a bug in the C++ implementation, I've opened #41771 for this. Maybe this is behaving as expected though.

I guess for now I could manually cleanup the temporary directory and use the force option ignore errors.

Adding a GC.start also seems to work but I don't think that would be a reliable workaround.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work?

diff --git a/c_glib/test/dataset/test-file-system-dataset.rb b/c_glib/test/dataset/test-file-system-dataset.rb
index ac67c6d61d..972c2d2ef9 100644
--- a/c_glib/test/dataset/test-file-system-dataset.rb
+++ b/c_glib/test/dataset/test-file-system-dataset.rb
@@ -21,8 +21,7 @@ class TestDatasetFileSystemDataset < Test::Unit::TestCase
 
   def setup
     omit("Arrow Dataset is required") unless defined?(ArrowDataset)
-    tmpdir = Dir.mktmpdir
-    begin
+    Dir.mktmpdir do |tmpdir|
       @dir = tmpdir
       @format = ArrowDataset::IPCFileFormat.new
       @factory = ArrowDataset::FileSystemDatasetFactory.new(@format)
@@ -33,14 +32,6 @@ class TestDatasetFileSystemDataset < Test::Unit::TestCase
         ArrowDataset::DirectoryPartitioning.new(partitioning_schema)
       @factory.partitioning = @partitioning
       yield
-    ensure
-      # We have to ignore errors trying to remove the directory due to
-      # the RecordBatchReader not closing files
-      # (https://github.com/apache/arrow/issues/41771).
-      # Also request GC first which should free any remaining RecordBatchReader
-      # and close open files.
-      GC.start
-      FileUtils.remove_entry(tmpdir, force: true)
     end
   end
 
@@ -72,7 +63,11 @@ class TestDatasetFileSystemDataset < Test::Unit::TestCase
   def test_record_batch_reader
     dataset, expected_table = create_dataset
     reader = dataset.to_record_batch_reader
-    assert_equal(expected_table, reader.read_all)
+    begin
+      assert_equal(expected_table, reader.read_all)
+    ensure
+      reader.unref
+    end
   end
 
   def create_dataset
diff --git a/c_glib/test/dataset/test-scanner.rb b/c_glib/test/dataset/test-scanner.rb
index ef834674cb..fe3478a866 100644
--- a/c_glib/test/dataset/test-scanner.rb
+++ b/c_glib/test/dataset/test-scanner.rb
@@ -21,8 +21,7 @@ class TestDatasetScanner < Test::Unit::TestCase
 
   def setup
     omit("Arrow Dataset is required") unless defined?(ArrowDataset)
-    tmpdir = Dir.mktmpdir
-    begin
+    Dir.mktmpdir do |tmpdir|
       path = File.join(tmpdir, "table.arrow")
       @table = build_table(visible: [
                              build_boolean_array([true, false, true]),
@@ -40,14 +39,6 @@ class TestDatasetScanner < Test::Unit::TestCase
       builder = @dataset.begin_scan
       @scanner = builder.finish
       yield
-    ensure
-      # We have to ignore errors trying to remove the directory due to
-      # the RecordBatchReader not closing files
-      # (https://github.com/apache/arrow/issues/41771).
-      # Also request GC first which should free any remaining RecordBatchReader
-      # and close open files.
-      GC.start
-      FileUtils.remove_entry(tmpdir, force: true)
     end
   end
 
@@ -56,6 +47,11 @@ class TestDatasetScanner < Test::Unit::TestCase
   end
 
   def test_record_batch_reader
-    assert_equal(@table, @scanner.to_record_batch_reader.read_all)
+    reader = @scanner.to_record_batch_reader
+    begin
+      assert_equal(@table, reader.read_all)
+    ensure
+      reader.unref
+    end
   end
 end

c_glib/arrow-dataset-glib/dataset.cpp Outdated Show resolved Hide resolved
c_glib/arrow-dataset-glib/dataset.cpp Outdated Show resolved Hide resolved
c_glib/arrow-dataset-glib/dataset.cpp Outdated Show resolved Hide resolved
c_glib/arrow-dataset-glib/dataset.cpp Outdated Show resolved Hide resolved
c_glib/arrow-dataset-glib/scanner.cpp Outdated Show resolved Hide resolved
c_glib/arrow-dataset-glib/scanner.cpp Outdated Show resolved Hide resolved
c_glib/test/dataset/test-scanner.rb Outdated Show resolved Hide resolved
c_glib/test/dataset/test-file-system-dataset.rb Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 25, 2024
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
@github-actions github-actions bot removed the awaiting changes Awaiting changes label May 25, 2024
@github-actions github-actions bot added the awaiting change review Awaiting change review label May 25, 2024
@adamreeve
Copy link
Contributor Author

Does this work?

Yes that works thanks, that's much better. Although it looks like the unref call may be unnecessary after #41824, so should I wait for that to be merged then remove the unref call?

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We can use unref for now because #41824 may be rejected.

@kou kou merged commit 1c9e393 into apache:main May 25, 2024
10 checks passed
@kou kou removed the awaiting change review Awaiting change review label May 25, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label May 25, 2024
@adamreeve adamreeve deleted the glib_dataset_reader branch May 25, 2024 21:46
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1c9e393.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 148 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants