Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Optimize IPC stream reading #25514

Open
asfimport opened this issue Jul 13, 2020 · 4 comments
Open

[C++] Optimize IPC stream reading #25514

asfimport opened this issue Jul 13, 2020 · 4 comments

Comments

@asfimport
Copy link

Based on perf reports, more time is spent manipulating C++ data structures than reconstructing record batches from IPC messages, which strikes me as not what we want

here is from a perf report based on the Python code

for i in range(100):
    pa.ipc.open_stream('nyctaxi.arrow').read_all()
-   50.40%     0.06%  python           libarrow.so.100.0.0                  [.] arrow::RecordBatchReader::ReadAll
   - 50.34% arrow::RecordBatchReader::ReadAll     
      - 25.86% arrow::Table::FromRecordBatches    
         - 18.41% arrow::SimpleRecordBatch::column
            - 16.00% arrow::MakeArray
               - 10.49% arrow::VisitTypeInline<arrow::internal::ArrayDataWrapper>  
                    7.71% arrow::PrimitiveArray::SetData           
                    1.87% arrow::StringArray::StringArray          
           1.54% __pthread_mutex_lock                              
           0.88% __pthread_mutex_unlock                            
           0.67% std::_Hash_bytes                                  
           0.60% arrow::ChunkedArray::ChunkedArray                 
      - 22.30% arrow::RecordBatchReader::ReadAll                   
         - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext
            - 15.91% arrow::ipc::ReadRecordBatchInternal
               - 15.15% arrow::ipc::LoadRecordBatch
                  - 14.45% arrow::ipc::ArrayLoader::Load
                     + 13.15% arrow::VisitTypeInline<arrow::ipc::ArrayLoader>
            + 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage 
        1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch

Perhaps ChunkedArray internally should be changed to contain a vector of ArrayData instead of boxed Arrays.

Reporter: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-9441. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Ji Liu / @tianchen92:
I recently worked with a project using Arrow cpp code, i am interested in this issue, could I take this one? :)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
You can work on it yes, but I'm concerned the changes involved may be quite invasive, so let me know what you find out when you look into it

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
Postponing it to 4.0

@asfimport
Copy link
Author

Todd Farmer / @toddfarmer:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant