Should `GetIOThreadPool()` be accessible from installed headers? #15151

paleolimbot · 2023-01-02T14:28:06Z

Describe the enhancement requested

In #14582 it was found that using the CPU thread pool in arrow::compute::MakeReaderGenerator() caused problems when the number of CPU threads was limited (as it often is on CI machines with few available cores). The solution was to use the IO thread pool for this; however, arrow::io::internal::GetIOThreadPool() is not available in any installed headers. I don't know what the best way to make this available would be (or whether creating a source node from a record batch reader should be baked into the internals somewhere); however, my hack of:

namespace arrow {
namespace io {
namespace internal {
arrow::internal::ThreadPool* GetIOThreadPool();
}
}  // namespace io
}  // namespace arrow

...in the R package should almost certainly not exist.

Component(s)

C++

The text was updated successfully, but these errors were encountered:

westonpace · 2023-01-02T15:26:09Z

Yes, I think it's entirely appropriate to put RecordBatchReader->source node in the C++ code. On the output side we have collector variants for tables (DeclarationToTable), vector of record batches (DeclarationToBatches) and record batch reader (DeclarationToReader). We already have source node variants for accepting data from a table (table_source) and a vector of record batches (record_batch_source). So I think it would be a good addition to add record_batch_reader_source.

However, I'm also not sure why we wouldn't expose the default I/O pool (arrow::io::internal::GetIOThreadPool) in the public headers given that we have public methods for getting and setting the size.

As a short term hack you can do:

#include "arrow/io/type_fwd.h"

arrow::io::IOContext io_context = arrow::io::default_io_context();
arrow::internal::Executor* io_executor = io_context.executor();

paleolimbot · 2023-01-02T16:39:21Z

As a short term hack you can do:

We do this already in the R package for the place where we need to use the IO thread pool to submit jobs...the problem here is that we need a ThreadPool* to pass to MakeReaderGenerator().

I think it would be a good addition to add record_batch_reader_source.

That would be my preferred solution...I'd rather not maintain the best way to do that in the R package and it's come up on the mailing list in a context unrelated to the R package as well ( https://lists.apache.org/thread/zo9qq0pntkrt2vnczoxx7hfsl6k233zy ).

vibhatha · 2023-01-03T03:36:37Z

take

vibhatha · 2023-01-03T04:01:28Z

As a short term hack you can do:

We do this already in the R package for the place where we need to use the IO thread pool to submit jobs...the problem here is that we need a ThreadPool* to pass to MakeReaderGenerator().

I think it would be a good addition to add record_batch_reader_source.

That would be my preferred solution...I'd rather not maintain the best way to do that in the R package and it's come up on the mailing list in a context unrelated to the R package as well ( https://lists.apache.org/thread/zo9qq0pntkrt2vnczoxx7hfsl6k233zy ).

@paleolimbot The reference link here which refers to this code block is outdated AFAIU.

paleolimbot · 2023-01-03T14:59:08Z

Yes, sorry! The block was this one:

arrow/r/src/compute-exec.cpp

Lines 459 to 468 in 63b91cc

    
           std::shared_ptr<compute::ExecNode> ExecNode_SourceNode( 
        
               const std::shared_ptr<compute::ExecPlan>& plan, 
        
               const std::shared_ptr<arrow::RecordBatchReader>& reader) { 
        
             arrow::compute::SourceNodeOptions options{ 
        
                 /*output_schema=*/reader->schema(), 
        
                 /*generator=*/ValueOrStop( 
        
                     compute::MakeReaderGenerator(reader, arrow::io::internal::GetIOThreadPool()))}; 
        
             return MakeExecNodeOrStop("source", plan.get(), {}, options); 
        
           }

vibhatha · 2023-01-03T15:02:50Z

Thanks @paleolimbot, I will work on this.

vibhatha · 2023-01-04T08:28:02Z

@westonpace @paleolimbot I created a draft PR to use a RecordBatchReader directly as the data source. I would appreciate your reviews to see if it addresses the problem as expected.

…R API (#15183) This PR includes the factory `record_batch_reader_source` for the Acero. This is a source node which takes in a `RecordBatchReader` as the data source along an executor which gives the freedom to choose the threadpool required for the execution. Also an example shows how this can be used in Acero. - [x] Self-review * Closes: #15151 Lead-authored-by: vibhatha <vibhatha@gmail.com> Co-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

paleolimbot added the Type: enhancement label Jan 2, 2023

paleolimbot mentioned this issue Jan 2, 2023

ARROW-18240: [R] head() is crashing on some nightly builds #14582

Merged

github-actions bot assigned vibhatha Jan 3, 2023

github-actions bot mentioned this issue Jan 4, 2023

GH-15151: [C++] ]Adding RecordBatchReaderSource to solve an issue in R API #15183

Merged

1 task

rok added the Component: C++ label Jan 8, 2023

westonpace closed this as completed in #15183 Jan 12, 2023

westonpace added this to the 11.0.0 milestone Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should `GetIOThreadPool()` be accessible from installed headers? #15151

Should `GetIOThreadPool()` be accessible from installed headers? #15151

paleolimbot commented Jan 2, 2023

westonpace commented Jan 2, 2023

paleolimbot commented Jan 2, 2023

vibhatha commented Jan 3, 2023

vibhatha commented Jan 3, 2023 •

edited

Loading

paleolimbot commented Jan 3, 2023

vibhatha commented Jan 3, 2023

vibhatha commented Jan 4, 2023

Should GetIOThreadPool() be accessible from installed headers? #15151

Should GetIOThreadPool() be accessible from installed headers? #15151

Comments

paleolimbot commented Jan 2, 2023

Describe the enhancement requested

Component(s)

westonpace commented Jan 2, 2023

paleolimbot commented Jan 2, 2023

vibhatha commented Jan 3, 2023

vibhatha commented Jan 3, 2023 • edited Loading

paleolimbot commented Jan 3, 2023

vibhatha commented Jan 3, 2023

vibhatha commented Jan 4, 2023

Should `GetIOThreadPool()` be accessible from installed headers? #15151

Should `GetIOThreadPool()` be accessible from installed headers? #15151

vibhatha commented Jan 3, 2023 •

edited

Loading